100+ datasets found
  1. d

    Bioinformatics Links Directory

    • dknet.org
    • test2.scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Bioinformatics Links Directory [Dataset]. http://identifiers.org/RRID:SCR_008018
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Database of curated links to molecular resources, tools and databases selected on the basis of recommendations from bioinformatics experts in the field. This resource relies on input from its community of bioinformatics users for suggestions. Starting in 2003, it has also started listing all links contained in the NAR Webserver issue. The different types of information available in this portal: * Computer Related: This category contains links to resources relating to programming languages often used in bioinformatics. Other tools of the trade, such as web development and database resources, are also included here. * Sequence Comparison: Tools and resources for the comparison of sequences including sequence similarity searching, alignment tools, and general comparative genomics resources. * DNA: This category contains links to useful resources for DNA sequence analyses such as tools for comparative sequence analysis and sequence assembly. Links to programs for sequence manipulation, primer design, and sequence retrieval and submission are also listed here. * Education: Links to information about the techniques, materials, people, places, and events of the greater bioinformatics community. Included are current news headlines, literature sources, educational material and links to bioinformatics courses and workshops. * Expression: Links to tools for predicting the expression, alternative splicing, and regulation of a gene sequence are found here. This section also contains links to databases, methods, and analysis tools for protein expression, SAGE, EST, and microarray data. * Human Genome: This section contains links to draft annotations of the human genome in addition to resources for sequence polymorphisms and genomics. Also included are links related to ethical discussions surrounding the study of the human genome. * Literature: Links to resources related to published literature, including tools to search for articles and through literature abstracts. Additional text mining resources, open access resources, and literature goldmines are also listed. * Model Organisms: Included in this category are links to resources for various model organisms ranging from mammals to microbes. These include databases and tools for genome scale analyses. * Other Molecules: Bioinformatics tools related to molecules other than DNA, RNA, and protein. This category will include resources for the bioinformatics of small molecules as well as for other biopolymers including carbohydrates and metabolites. * Protein: This category contains links to useful resources for protein sequence and structure analyses. Resources for phylogenetic analyses, prediction of protein features, and analyses of interactions are also found here. * RNA: Resources include links to sequence retrieval programs, structure prediction and visualization tools, motif search programs, and information on various functional RNAs.

  2. e

    Data from: PROSITE

    • prosite.expasy.org
    • the-mouth.com
    • +7more
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE [Dataset]. https://prosite.expasy.org/
    Explore at:
    Dataset updated
    Jun 18, 2025
    Description

    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].

  3. q

    DNA Detective: Genotype to Phenotype. A Bioinformatics Workshop for Middle...

    • qubeshub.org
    Updated Aug 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne Sternberger*; Sarah Wyatt (2021). DNA Detective: Genotype to Phenotype. A Bioinformatics Workshop for Middle School to College. [Dataset]. http://doi.org/10.24918/cs.2019.34
    Explore at:
    Dataset updated
    Aug 29, 2021
    Dataset provided by
    QUBES
    Authors
    Anne Sternberger*; Sarah Wyatt
    Description

    Advances in high-throughput techniques have resulted in a rising demand for scientists with basic bioinformatics skills as well as workshops and curricula that teach students bioinformatics concepts. DNA Detective is a workshop we designed to introduce students to big data and bioinformatics using CyVerse and the Dolan DNA Learning Center's online DNA Subway platform. DNA Subway is a user-friendly workspace for genome analysis and uses the metaphor of a network of subway lines to familiarize users with the steps involved in annotating and comparing DNA sequences. For DNA Detective, we use the DNA Subway Red Line to guide students through analyzing a "mystery" DNA sequence to distinguish its gene structure and name. During the workshop, students are assigned a unique Arabidopsis thaliana DNA sequence. Students "travel" the Red Line to computationally find and remove sequence repeats, use gene prediction software to identify structural elements of the sequence, search databases of known genes to determine the identity of their mystery sequence, and synthesize these results into a model of their gene. Next, students use The Arabidopsis Information Resource (TAIR) to identify their gene's function so they can hypothesize what a mutant plant lacking that gene might look like (its phenotype). Then, from a group of plants in the room, students select the plant they think is most likely defective for their gene. Through this workshop, students are acquainted to the flow of genetic information from genotype to phenotype and tackle complex genomics analyses in hopes of inspiring and empowering them towards continued science education.

  4. f

    The percentage identities and similarities of CSD between CfCSP and other...

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chuanyan Yang; Lingling Wang; Vinu S. Siva; Xiaowei Shi; Qiufen Jiang; Jingjing Wang; Huan Zhang; Linsheng Song (2023). The percentage identities and similarities of CSD between CfCSP and other CSD containing proteins. [Dataset]. http://doi.org/10.1371/journal.pone.0032012.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Chuanyan Yang; Lingling Wang; Vinu S. Siva; Xiaowei Shi; Qiufen Jiang; Jingjing Wang; Huan Zhang; Linsheng Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    I%: identity, calculated as the percentage of identical amino acids per position in alignments; S%: similarity, calculated as the percentage of identical plus similar residues. I% and S% were analyzed using the Ident and Sim Analysis provided on http://www.bioinformatics.org/sms/.

  5. e

    PROSITE profiles

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

  6. d

    Data from: Differential hippocampal gene expression is associated with...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Oct 31, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladimir V. Pravosudov; Timothy C. Roth II; Matthew L. Forister; Lara D. LaDage; Robin Kramer; Faye Schilkey; Alexander M. van der Linden; T. C. Roth (2012). Differential hippocampal gene expression is associated with climate-related natural variation in memory and the hippocampus in food-caching chickadees [Dataset]. http://doi.org/10.5061/dryad.dg237
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 31, 2012
    Dataset provided by
    Dryad
    Authors
    Vladimir V. Pravosudov; Timothy C. Roth II; Matthew L. Forister; Lara D. LaDage; Robin Kramer; Faye Schilkey; Alexander M. van der Linden; T. C. Roth
    Time period covered
    2012
    Area covered
    Manhattan Kansas, Anchorage Alaska
    Description

    assembled transcript sequencesFinal contigs from the assembly, minimum 100 bp. The names correspond to the specific transcript builds, where contig_id is the contig identifier, for example, avdl_parai-20111027|1234. Possibly includes UTRs. Sequences contain IUPAC ambiguity codes representing ambiguous bases (http://www.bioinformatics.org/sms/iupac.html).predicted protein sequencesProtein products predicted by ESTScan, minimum 30 aa. Sequence identifiers for these predicted products correspond to the associated nucleotide sequence in file assembled transcript sequences.txt, and are provided suffixes _0, _1, etc., to accommodate multiple predictions.

  7. Data from: Highly significant improvement of protein sequence alignments...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, png +1
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Athanasios Baltzis; Athanasios Baltzis; Leila Mansouri; Leila Mansouri; Suzanne Jin; Suzanne Jin; Björn E. Langer; Björn E. Langer; Ionas Erb; Ionas Erb; Cedric Notredame; Cedric Notredame (2024). Highly significant improvement of protein sequence alignments with AlphaFold2 [Dataset]. http://doi.org/10.5281/zenodo.7031286
    Explore at:
    tsv, png, application/gzipAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Athanasios Baltzis; Athanasios Baltzis; Leila Mansouri; Leila Mansouri; Suzanne Jin; Suzanne Jin; Björn E. Langer; Björn E. Langer; Ionas Erb; Ionas Erb; Cedric Notredame; Cedric Notredame
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data, figures and tables from the manuscript "Highly significant improvement of protein sequence alignments with AlphaFold2" (https://doi.org/10.1093/bioinformatics/btac625).

    The repository containing all the steps to replicate the analysis is available at GitHub (https://github.com/cbcrg/msa-af2-nf).

    *The authors Athanasios Baltzis and Leila Mansouri contributed equally.

  8. Genomics England - Bioinformatics

    • healthdatagateway.org
    unknown
    Updated Mar 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy. (2023). Genomics England - Bioinformatics [Dataset]. https://healthdatagateway.org/dataset/381
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Mar 30, 2023
    Dataset provided by
    Genomics England
    Authors
    The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy.
    License

    https://www.genomicsengland.co.uk/about-gecip/joining-research-community/https://www.genomicsengland.co.uk/about-gecip/joining-research-community/

    Description

    To identify and enrol participants for the 100,000 Genomes Project we have created NHS Genomic Medicine Centres (GMCs). Each centre includes several NHS Trusts and hospitals. GMCs recruit and consent patients. They then provide DNA samples and clinical information for analysis.

    Illumina, a biotechnology company, have been commissioned to sequence the DNA of participants. They return the whole genome sequences to Genomics England. We have created a secure, monitored, infrastructure to store the genome sequences and clinical data. The data is analysed within this infrastructure and any important findings, like a diagnosis, are passed back to the patient’s doctor.

    To help make sure that the project brings benefits for people who take part, we have created the Genomics England Clinical Interpretation Partnership (GeCIP). GeCIP brings together funders, researchers, NHS teams and trainees. They will analyse the data – to help ensure benefits for patients and an increased understanding of genomics. The data will also be used for medical and scientific research. This could be research into diagnosing, understanding or treating disease.

    To learn more about how we work you can read the 100,000 Genomes Project protocol. It has details of the development, delivery and operation of the project. It also sets out the patient and clinical benefit, scientific and transformational objectives, the implementation strategy and the ethical and governance frameworks.

  9. INSDC Environment Sample Sequences

    • gbif.org
    • researchdata.edu.au
    Updated Jul 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`

    EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

    The data was then processed as follows:

    1. Human sequences were excluded.

    2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

    3. Contigs and whole genome shotgun (WGS) records were added individually.

    4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

    5. The records associated with the same vouchers are aggregated together.

    6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

    7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

    More information available here: https://github.com/gbif/embl-adapter#readme

    You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  10. Data from: Ensembl TSS dataset for GRCh38

    • zenodo.org
    • portalcienciaytecnologia.jcyl.es
    • +2more
    bin
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

    First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

    Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
    et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
    idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
    as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.

  11. Protein Secondary Structure

    • kaggle.com
    zip
    Updated Jun 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    -_- (2018). Protein Secondary Structure [Dataset]. https://www.kaggle.com/alfrandom/protein-secondary-structure
    Explore at:
    zip(40687706 bytes)Available download formats
    Dataset updated
    Jun 6, 2018
    Authors
    -_-
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Introduction

    Protein secondary structure can be calculated based on its atoms' 3D coordinates once the protein's 3D structure is solved using X-ray crystallography or NMR. Commonly, DSSP is the tool used for calculating the secondary structure and assigns one of the following secondary structure types (https://swift.cmbi.umcn.nl/gv/dssp/index.html) to every amino acid in a protein:

    1. C: Loops and irregular elements (corresponding to the blank characters output by DSSP)
    2. E: β-strand
    3. H: α-helix
    4. B: β-bridge
    5. G: 3-helix
    6. I: π-helix
    7. T: Turn
    8. S: Bend

    However, X-ray or NMR is expensive. Ideally, we would like to predict the secondary structure of a protein based on its primary sequence directly, which has had a long history. A review on this topic is published recently, Sixty-five years of the long march in protein secondary structure prediction: the final stretch?.

    For the purpose of secondary structure prediction, it is common to simplify the aforementioned eight states (Q8) into three (Q3) by merging (E, B) into E, (H, G, I) into E, and (C, S, T) into C. The current accuracy for three-state (Q3) secondary structure prediction is about ~85% while that for eight-state (Q8) prediction is <70%. The exact number depends on the particular test dataset used.

    Dataset

    The main dataset lists peptide sequences and their corresponding secondary structures. It is a transformation of https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz downloaded at 2018-06-06 from RSCB PDB into a tabular structure. If you download the file at a later time, the number of sequences in it will probably increase.

    Description of columns:

    1. pdb_id: the id used to locate its entry on https://www.rcsb.org/
    2. chain_code: when a protein consists of multiple peptides (chains), the chain code is needed to locate a particular one.
    3. seq: the sequence of the peptide
    4. sst8: the eight-state (Q8) secondary structure
    5. sst3: the three-state (Q3) secondary structure
    6. len: the length of the peptide
    7. has_nonstd_aa: whether the peptide contains nonstandard amino acids (B, O, U, X, or Z).

    Key steps in the transformation:

    • Both Q3 and Q8 secondary structure sequences are listed.
    • All nonstandard amino acids, which includes B, O, U, X, and Z, (see here for their meanings) are masked with "*" character.
    • An additional column (has_nonstd_aa) is added to indicate whether the protein sequence contains nonstandard amino acids.
    • A subset of the sequences with low sequence identity and high resolution, ready for training, is also provided

    For details of curation, please see https://github.com/zyxue/pdb-secondary-structure.

    A subset (9079 sequences) based on sequences culled by PISCES with more strict quality control is also provided. This dataset is considered ready for training models.

    The culled subset generated on 2018-05-31 with cutoffs of 25%, 2Å, and 0.25 for sequence identity, resolution and R-factor respectively, is used. The URL to the original culled list is http://dunbrack.fccc.edu/Guoli/culledpdb_hh/cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz, but it may not be permanently available. This dataset contains more columns from cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz with self-explanatory names.

    For more about PISCES, please see https://academic.oup.com/bioinformatics/article/19/12/1589/258419.

    Acknowledgements

    The peptide sequence and secondary structure are downloaded from https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz. The culled subset is downloaded from http://dunbrack.fccc.edu/PISCES.php.

    Inspiration

    Kaggle provides a great platform for sharing ideas and solving data science problem. Sharing a cleaned dataset help prevent others from duplicated work and also provides a common dataset for more comparable benchmark among different methods.

    Early attempts on this (or related) problem:

    1. Baldi, Pierre, Søren Brunak, Paolo Frasconi, Gianluca Pollastri and Giovanni Soda. “Bidirectional Dynamics for Protein Secondary Structure Prediction.” Sequence Learning (2001). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf
    2. Chen, J. and Chaudhari, N. S.. "Protein Secondary Structure Prediction with bidirectional LSTM networks." Paper presented at the meeting of the Post-Conference Workshop on Computational Intelligence Approaches for the Analysis of Bio-data (CI-BIO), Montreal, Canada, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf (Couldn't find a pdf)
    3. Sepp Hochreiter, Martin Heusel, Klaus Obermayer; Fast model-based protein homology detection without alignment, Bioinformatics, Volume 23, Issue 14, 15 July 2007, Pages 1728–1736, https://doi.org/10.1093/bioinformatics/btm247
  12. o

    WORKSHOP: Introduction to Metabarcoding using QIIME2

    • explore.openaire.eu
    Updated Feb 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashley Dungan; Gayle Philip; Andrew Perry; Rania Ismail; Laura Geissler; Kshitij Tandon; Igor Makunin (2022). WORKSHOP: Introduction to Metabarcoding using QIIME2 [Dataset]. http://doi.org/10.5281/zenodo.6350807
    Explore at:
    Dataset updated
    Feb 22, 2022
    Authors
    Ashley Dungan; Gayle Philip; Andrew Perry; Rania Ismail; Laura Geissler; Kshitij Tandon; Igor Makunin
    Description

    This record includes training materials associated with the Australian BioCommons workshop ���Introduction to Metabarcoding using QIIME2���. This workshop took place on 22 February 2022. Event description Metabarcoding has revolutionised the study of biodiversity science. By combining DNA taxonomy with high-throughput DNA sequencing, it offers the potential to observe a larger diversity in the taxa within a single sample, rapidly expanding the scope of microbial analysis and generating high-quality biodiversity data. This workshop will introduce the topic of metabarcoding and how you can use Qiime2 to analyse 16S data and gain simultaneous identification of all taxa within a sample. Qiime2 is a popular tool used to perform powerful microbiome analysis that can transform your raw data into publication quality visuals and statistics. In this workshop, using example 16S data from the shallow-water marine anemone E. diaphana, you will learn how to use this pipeline to run essential steps in microbial analysis including generating taxonomic assignments and phylogenic trees, and performing both alpha- and beta- diversity analysis. Materials are shared under a Creative Commons Attribution 4.0 International agreement unless otherwise specified and were current at the time of the event. Files and materials included in this record: Event metadata (PDF): Information about the event including, description, event URL, learning objectives, prerequisites, technical requirements etc. Index of training materials (PDF): List and description of all materials associated with this event including the name, format, location and a brief description of each file. Schedule (PDF): A breakdown of the topics and timings for the workshop Materials shared elsewhere: This workshop follows the tutorial ���Introduction to metabarcoding with QIIME2��� which has been made publicly available by Melbourne Bioinformatics. https://www.melbournebioinformatics.org.au/tutorials/tutorials/qiime2/qiime2/

  13. r

    Alternative Splicing Annotation Project II Database

    • rrid.site
    • neuinfo.org
    • +3more
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
    Explore at:
    Dataset updated
    Jun 26, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.

  14. q

    Bioinformatics: An Interactive Introduction to NCBI

    • qubeshub.org
    Updated Jan 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seth Bordenstein (2019). Bioinformatics: An Interactive Introduction to NCBI [Dataset]. http://doi.org/10.25334/Q4915C
    Explore at:
    Dataset updated
    Jan 3, 2019
    Dataset provided by
    QUBES
    Authors
    Seth Bordenstein
    Description

    Modules showing how the NCBI database classifies and organizes information on DNA sequences, evolutionary relationships, and scientific publications. And a module working to identify a nucleotide sequence from an insect endosymbiont by using BLAST

  15. d

    RBG structural bioinformatics general files

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Levine, Tim P (2023). RBG structural bioinformatics general files [Dataset]. http://doi.org/10.7910/DVN/VSD9SS
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Levine, Tim P
    Description

    Many different files for the project including: PDB files with CONSURF with Conservation Scores in the tempFactor field (low numbers = high conservation) • protein sequences (2 folders, all formatted for DNA Strider 1.5a12) DALI results for domains predicted by ColabFold • trees - phylogeny work in iTOL and Phyml (not included in paper)

  16. f

    Bayesian NrdA phylogeny

    • su.figshare.com
    • researchdata.se
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Lundin (2023). Bayesian NrdA phylogeny [Dataset]. http://doi.org/10.17045/sthlmuni.11558187.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Stockholm University
    Authors
    Daniel Lundin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bayesian phylogeny of NrdA, class I ribonucleotide reductase catalytic component. Sequences from NCBI's RefSeq and Genbank databases (Haft et al. 2018; https://doi.org/10.1093/nar/gkx1068), downloaded March 2019, was searched with subclass specific HMMER (Eddy 2011; https://doi.org/10.1371/journal.pcbi.1002195) profiles for NrdA and NrdJ, class II RNR, serving as outgroup, (Lundin et al. in preparation). The resulting sequences were clustered at 60% identity with UCLUST (Edgar 2010; https://doi.org/10.1093/bioinformatics/btq461) to create a representative set of sequences. After manual inspection of sequences, 342 out of 27821 original NrdA sequences remained, plus 26 NrdJ sequences selected for aligning well to NrdA. The sequences were aligned with ProbCons (Do et al. 2005; https://doi.org/10.1101/gr.2821705) and 283 reliably aligned positions were selected with BMGE (Criscuolo & Gribaldo 2010; https://doi.org/10.1186/1471-2148-10-210) using the BLOSUM30 matrix. The alignment file is NrdA_uc0.60.NrdJ_uc0.30_outgroup.intr.correct.nolb.co.profile.BLOSUM30.bmge.mb.nxs. A bayesian phylogeny was estimated with MrBayes v. 3.2.6 (Ronquist & Huelsenbeck 2003; https://doi.org/10.1093/bioinformatics/btg180; https://github.com/NBISweden/MrBayes) using a gamma distribution for rate variation and rjMCMC to jump between amino acid models. MrBayes was run with four chains and five runs until average standard deviation of split frequencies reached 0.015. (See NrdA_uc0.60.NrdJ_uc0.30_outgroup.intr.correct.nolb.co.profile.BLOSUM30.bmge.mb.) The phylogeny, in Dendroscope (Huson et al. 2007; https://doi.org/10.1186/1471-2105-8-460) nexml format, isNrdA_uc0.60.NrdJ_uc0.30_outgroup.intr.correct.nolb.co.profile.BLOSUM30.bmge.mb.con.fullname.nexml .

  17. r

    k-Word matches: an alignment-free sequence comparison method

    • researchdata.edu.au
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Conrad J. Burden; Sylvain Forêt; Susan R. Wilson (2022). k-Word matches: an alignment-free sequence comparison method [Dataset]. http://doi.org/10.4225/03/5a1372cde0ad8
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Conrad J. Burden; Sylvain Forêt; Susan R. Wilson
    Description

    k-word matches, the number of words of length k shared between two sequences, also known as the D2 statistic, are used in alignment-free sequence comparison statistic. The advantages of the use of this statistic over alignment-based methods for nucleotide and amino-acid sequence comparisons are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. We summarise our results to date on determing the distributional properties of the D2 statistic for a range of biologically relevant parameters and outline the directions in which the research will proceed. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  18. INSDC Host Organism Sequences

    • gbif.org
    • researchdata.edu.au
    Updated Jul 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Host Organism Sequences [Dataset]. http://doi.org/10.15468/e97kmy
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.

    EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

    The data was then processed as follows:

    1. Human sequences were excluded.

    2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

    3. Contigs and whole genome shotgun (WGS) records were added individually.

    4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

    5. The records associated with the same vouchers are aggregated together.

    6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

    7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

    More information available here: https://github.com/gbif/embl-adapter#readme

    You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  19. d

    Whole genome DNA sequences of Gulf of Mexico invertebrates

    • search.dataone.org
    • data.griidc.org
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas, W. Kelley (2025). Whole genome DNA sequences of Gulf of Mexico invertebrates [Dataset]. http://doi.org/10.7266/n7-pchj-dh15
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    GRIIDC
    Authors
    Thomas, W. Kelley
    Area covered
    Gulf of Mexico (Gulf of America)
    Description

    The dataset consists of whole genome DNA sequences, generated from invertebrate species from the Gulf of Mexico during the Benthic Invertebrate Taxonomy, Metagenomics, and Bioinformatics Workshop (BITMaB) in 2017 in Corpus Christi, Texas, USA. All genomic data sets were deposited in and distributed by GenBank (NCBI), the European Nucleotide Archive (ENA)- European Bioinformatics Institute (EMBL-EBI), DNA Data Bank of Japan, NemATOL, the Global Genome Initiative, and Ocean Genome Legacy.

  20. d

    Raw motif mapping bedfile data and model training set class probabilities

    • search.dataone.org
    • datadryad.org
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillip Davis (2025). Raw motif mapping bedfile data and model training set class probabilities [Dataset]. http://doi.org/10.5061/dryad.tdz08kq3w
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Phillip Davis
    Time period covered
    Jan 1, 2023
    Description

    Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks. By leveraging a taxonomy-guided ‘group-shuffle-split’ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). Bioinformatics Links Directory [Dataset]. http://identifiers.org/RRID:SCR_008018

Bioinformatics Links Directory

RRID:SCR_008018, nif-0000-10170, Bioinformatics Links Directory (RRID:SCR_008018), Bioinformatics Links Directory, Canadian Bioinformatics.ca Links Directory, Bioinformatics.ca Links Directory

Explore at:
189 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 29, 2022
Description

Database of curated links to molecular resources, tools and databases selected on the basis of recommendations from bioinformatics experts in the field. This resource relies on input from its community of bioinformatics users for suggestions. Starting in 2003, it has also started listing all links contained in the NAR Webserver issue. The different types of information available in this portal: * Computer Related: This category contains links to resources relating to programming languages often used in bioinformatics. Other tools of the trade, such as web development and database resources, are also included here. * Sequence Comparison: Tools and resources for the comparison of sequences including sequence similarity searching, alignment tools, and general comparative genomics resources. * DNA: This category contains links to useful resources for DNA sequence analyses such as tools for comparative sequence analysis and sequence assembly. Links to programs for sequence manipulation, primer design, and sequence retrieval and submission are also listed here. * Education: Links to information about the techniques, materials, people, places, and events of the greater bioinformatics community. Included are current news headlines, literature sources, educational material and links to bioinformatics courses and workshops. * Expression: Links to tools for predicting the expression, alternative splicing, and regulation of a gene sequence are found here. This section also contains links to databases, methods, and analysis tools for protein expression, SAGE, EST, and microarray data. * Human Genome: This section contains links to draft annotations of the human genome in addition to resources for sequence polymorphisms and genomics. Also included are links related to ethical discussions surrounding the study of the human genome. * Literature: Links to resources related to published literature, including tools to search for articles and through literature abstracts. Additional text mining resources, open access resources, and literature goldmines are also listed. * Model Organisms: Included in this category are links to resources for various model organisms ranging from mammals to microbes. These include databases and tools for genome scale analyses. * Other Molecules: Bioinformatics tools related to molecules other than DNA, RNA, and protein. This category will include resources for the bioinformatics of small molecules as well as for other biopolymers including carbohydrates and metabolites. * Protein: This category contains links to useful resources for protein sequence and structure analyses. Resources for phylogenetic analyses, prediction of protein features, and analyses of interactions are also found here. * RNA: Resources include links to sequence retrieval programs, structure prediction and visualization tools, motif search programs, and information on various functional RNAs.

Search
Clear search
Close search
Google apps
Main menu