100+ datasets found

d
Bioinformatics Links Directory
dknet.org
test2.scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Bioinformatics Links Directory [Dataset]. http://identifiers.org/RRID:SCR_008018
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008018
Dataset updated
Jan 29, 2022
Description
Database of curated links to molecular resources, tools and databases selected on the basis of recommendations from bioinformatics experts in the field. This resource relies on input from its community of bioinformatics users for suggestions. Starting in 2003, it has also started listing all links contained in the NAR Webserver issue. The different types of information available in this portal: * Computer Related: This category contains links to resources relating to programming languages often used in bioinformatics. Other tools of the trade, such as web development and database resources, are also included here. * Sequence Comparison: Tools and resources for the comparison of sequences including sequence similarity searching, alignment tools, and general comparative genomics resources. * DNA: This category contains links to useful resources for DNA sequence analyses such as tools for comparative sequence analysis and sequence assembly. Links to programs for sequence manipulation, primer design, and sequence retrieval and submission are also listed here. * Education: Links to information about the techniques, materials, people, places, and events of the greater bioinformatics community. Included are current news headlines, literature sources, educational material and links to bioinformatics courses and workshops. * Expression: Links to tools for predicting the expression, alternative splicing, and regulation of a gene sequence are found here. This section also contains links to databases, methods, and analysis tools for protein expression, SAGE, EST, and microarray data. * Human Genome: This section contains links to draft annotations of the human genome in addition to resources for sequence polymorphisms and genomics. Also included are links related to ethical discussions surrounding the study of the human genome. * Literature: Links to resources related to published literature, including tools to search for articles and through literature abstracts. Additional text mining resources, open access resources, and literature goldmines are also listed. * Model Organisms: Included in this category are links to resources for various model organisms ranging from mammals to microbes. These include databases and tools for genome scale analyses. * Other Molecules: Bioinformatics tools related to molecules other than DNA, RNA, and protein. This category will include resources for the bioinformatics of small molecules as well as for other biopolymers including carbohydrates and metabolites. * Protein: This category contains links to useful resources for protein sequence and structure analyses. Resources for phylogenetic analyses, prediction of protein features, and analyses of interactions are also found here. * RNA: Resources include links to sequence retrieval programs, structure prediction and visualization tools, motif search programs, and information on various functional RNAs.
e
Data from: PROSITE
prosite.expasy.org
the-mouth.com
+7more
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE [Dataset]. https://prosite.expasy.org/
Explore at:
Dataset updated
Jun 18, 2025
Description
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
q
DNA Detective: Genotype to Phenotype. A Bioinformatics Workshop for Middle...
qubeshub.org
Updated Aug 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne Sternberger*; Sarah Wyatt (2021). DNA Detective: Genotype to Phenotype. A Bioinformatics Workshop for Middle School to College. [Dataset]. http://doi.org/10.24918/cs.2019.34
Explore at:
Unique identifier
https://doi.org/10.24918/cs.2019.34
Dataset updated
Aug 29, 2021
Dataset provided by
QUBES
Authors
Anne Sternberger*; Sarah Wyatt
Description
Advances in high-throughput techniques have resulted in a rising demand for scientists with basic bioinformatics skills as well as workshops and curricula that teach students bioinformatics concepts. DNA Detective is a workshop we designed to introduce students to big data and bioinformatics using CyVerse and the Dolan DNA Learning Center's online DNA Subway platform. DNA Subway is a user-friendly workspace for genome analysis and uses the metaphor of a network of subway lines to familiarize users with the steps involved in annotating and comparing DNA sequences. For DNA Detective, we use the DNA Subway Red Line to guide students through analyzing a "mystery" DNA sequence to distinguish its gene structure and name. During the workshop, students are assigned a unique Arabidopsis thaliana DNA sequence. Students "travel" the Red Line to computationally find and remove sequence repeats, use gene prediction software to identify structural elements of the sequence, search databases of known genes to determine the identity of their mystery sequence, and synthesize these results into a model of their gene. Next, students use The Arabidopsis Information Resource (TAIR) to identify their gene's function so they can hypothesize what a mutant plant lacking that gene might look like (its phenotype). Then, from a group of plants in the room, students select the plant they think is most likely defective for their gene. Through this workshop, students are acquainted to the flow of genetic information from genotype to phenotype and tackle complex genomics analyses in hopes of inspiring and empowering them towards continued science education.
f
The percentage identities and similarities of CSD between CfCSP and other...
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuanyan Yang; Lingling Wang; Vinu S. Siva; Xiaowei Shi; Qiufen Jiang; Jingjing Wang; Huan Zhang; Linsheng Song (2023). The percentage identities and similarities of CSD between CfCSP and other CSD containing proteins. [Dataset]. http://doi.org/10.1371/journal.pone.0032012.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0032012.t002
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Chuanyan Yang; Lingling Wang; Vinu S. Siva; Xiaowei Shi; Qiufen Jiang; Jingjing Wang; Huan Zhang; Linsheng Song
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
I%: identity, calculated as the percentage of identical amino acids per position in alignments; S%: similarity, calculated as the percentage of identical plus similar residues. I% and S% were analyzed using the Ident and Sim Analysis provided on http://www.bioinformatics.org/sms/.
e
PROSITE profiles
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
d
Data from: Differential hippocampal gene expression is associated with...
datadryad.org
search.dataone.org
+1more
zip
Updated Oct 31, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladimir V. Pravosudov; Timothy C. Roth II; Matthew L. Forister; Lara D. LaDage; Robin Kramer; Faye Schilkey; Alexander M. van der Linden; T. C. Roth (2012). Differential hippocampal gene expression is associated with climate-related natural variation in memory and the hippocampus in food-caching chickadees [Dataset]. http://doi.org/10.5061/dryad.dg237
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.dg237
Dataset updated
Oct 31, 2012
Dataset provided by
Dryad
Authors
Vladimir V. Pravosudov; Timothy C. Roth II; Matthew L. Forister; Lara D. LaDage; Robin Kramer; Faye Schilkey; Alexander M. van der Linden; T. C. Roth
Time period covered
2012
Area covered
Manhattan Kansas, Anchorage Alaska
Description
assembled transcript sequencesFinal contigs from the assembly, minimum 100 bp. The names correspond to the specific transcript builds, where contig_id is the contig identifier, for example, avdl_parai-20111027|1234. Possibly includes UTRs. Sequences contain IUPAC ambiguity codes representing ambiguous bases (http://www.bioinformatics.org/sms/iupac.html).predicted protein sequencesProtein products predicted by ESTScan, minimum 30 aa. Sequence identifiers for these predicted products correspond to the associated nucleotide sequence in file assembled transcript sequences.txt, and are provided suffixes _0, _1, etc., to accommodate multiple predictions.
Data from: Highly significant improvement of protein sequence alignments...
zenodo.org
data.niaid.nih.gov
application/gzip, png +1
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Athanasios Baltzis; Athanasios Baltzis; Leila Mansouri; Leila Mansouri; Suzanne Jin; Suzanne Jin; Björn E. Langer; Björn E. Langer; Ionas Erb; Ionas Erb; Cedric Notredame; Cedric Notredame (2024). Highly significant improvement of protein sequence alignments with AlphaFold2 [Dataset]. http://doi.org/10.5281/zenodo.7031286
Explore at:
tsv, png, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7031286
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Athanasios Baltzis; Athanasios Baltzis; Leila Mansouri; Leila Mansouri; Suzanne Jin; Suzanne Jin; Björn E. Langer; Björn E. Langer; Ionas Erb; Ionas Erb; Cedric Notredame; Cedric Notredame
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data, figures and tables from the manuscript "Highly significant improvement of protein sequence alignments with AlphaFold2" (https://doi.org/10.1093/bioinformatics/btac625).

The repository containing all the steps to replicate the analysis is available at GitHub (https://github.com/cbcrg/msa-af2-nf).

*The authors Athanasios Baltzis and Leila Mansouri contributed equally.
Genomics England - Bioinformatics
healthdatagateway.org
unknown
Updated Mar 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy. (2023). Genomics England - Bioinformatics [Dataset]. https://healthdatagateway.org/dataset/381
Explore at:
unknownAvailable download formats
Dataset updated
Mar 30, 2023
Dataset provided by
Genomics England
Authors
The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy.
License
https://www.genomicsengland.co.uk/about-gecip/joining-research-community/https://www.genomicsengland.co.uk/about-gecip/joining-research-community/
Description
To identify and enrol participants for the 100,000 Genomes Project we have created NHS Genomic Medicine Centres (GMCs). Each centre includes several NHS Trusts and hospitals. GMCs recruit and consent patients. They then provide DNA samples and clinical information for analysis.

Illumina, a biotechnology company, have been commissioned to sequence the DNA of participants. They return the whole genome sequences to Genomics England. We have created a secure, monitored, infrastructure to store the genome sequences and clinical data. The data is analysed within this infrastructure and any important findings, like a diagnosis, are passed back to the patient’s doctor.

To help make sure that the project brings benefits for people who take part, we have created the Genomics England Clinical Interpretation Partnership (GeCIP). GeCIP brings together funders, researchers, NHS teams and trainees. They will analyse the data – to help ensure benefits for patients and an increased understanding of genomics. The data will also be used for medical and scientific research. This could be research into diagnosing, understanding or treating disease.

To learn more about how we work you can read the 100,000 Genomes Project protocol. It has details of the development, delivery and operation of the project. It also sets out the patient and clinical benefit, scientific and transformational objectives, the implementation strategy and the ethical and governance frameworks.
INSDC Environment Sample Sequences
gbif.org
researchdata.edu.au
Updated Jul 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
Explore at:
Unique identifier
https://doi.org/10.15468/mcmd5g
Dataset updated
Jul 12, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Authors
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Data from: Ensembl TSS dataset for GRCh38
zenodo.org
portalcienciaytecnologia.jcyl.es
+2more
bin
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7147597
Dataset updated
Aug 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.
Protein Secondary Structure
kaggle.com
zip
Updated Jun 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
-_- (2018). Protein Secondary Structure [Dataset]. https://www.kaggle.com/alfrandom/protein-secondary-structure
Explore at:
zip(40687706 bytes)Available download formats
Dataset updated
Jun 6, 2018
Authors
-_-
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Introduction

Protein secondary structure can be calculated based on its atoms' 3D coordinates once the protein's 3D structure is solved using X-ray crystallography or NMR. Commonly, DSSP is the tool used for calculating the secondary structure and assigns one of the following secondary structure types (https://swift.cmbi.umcn.nl/gv/dssp/index.html) to every amino acid in a protein:

C: Loops and irregular elements (corresponding to the blank characters output by DSSP)

E: β-strand

H: α-helix

B: β-bridge

G: 3-helix

I: π-helix

T: Turn

S: Bend

However, X-ray or NMR is expensive. Ideally, we would like to predict the secondary structure of a protein based on its primary sequence directly, which has had a long history. A review on this topic is published recently, Sixty-five years of the long march in protein secondary structure prediction: the final stretch?.

For the purpose of secondary structure prediction, it is common to simplify the aforementioned eight states (Q8) into three (Q3) by merging (E, B) into E, (H, G, I) into E, and (C, S, T) into C. The current accuracy for three-state (Q3) secondary structure prediction is about ~85% while that for eight-state (Q8) prediction is <70%. The exact number depends on the particular test dataset used.

Dataset

The main dataset lists peptide sequences and their corresponding secondary structures. It is a transformation of https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz downloaded at 2018-06-06 from RSCB PDB into a tabular structure. If you download the file at a later time, the number of sequences in it will probably increase.

Description of columns:

pdb_id: the id used to locate its entry on https://www.rcsb.org/

chain_code: when a protein consists of multiple peptides (chains), the chain code is needed to locate a particular one.

seq: the sequence of the peptide

sst8: the eight-state (Q8) secondary structure

sst3: the three-state (Q3) secondary structure

len: the length of the peptide

has_nonstd_aa: whether the peptide contains nonstandard amino acids (B, O, U, X, or Z).

Key steps in the transformation:

Both Q3 and Q8 secondary structure sequences are listed.

All nonstandard amino acids, which includes B, O, U, X, and Z, (see here for their meanings) are masked with "*" character.

An additional column (has_nonstd_aa) is added to indicate whether the protein sequence contains nonstandard amino acids.

A subset of the sequences with low sequence identity and high resolution, ready for training, is also provided

For details of curation, please see https://github.com/zyxue/pdb-secondary-structure.

A subset (9079 sequences) based on sequences culled by PISCES with more strict quality control is also provided. This dataset is considered ready for training models.

The culled subset generated on 2018-05-31 with cutoffs of 25%, 2Å, and 0.25 for sequence identity, resolution and R-factor respectively, is used. The URL to the original culled list is http://dunbrack.fccc.edu/Guoli/culledpdb_hh/cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz, but it may not be permanently available. This dataset contains more columns from cullpdb_pc25_res2.0_R0.25_d180531_chains9099.gz with self-explanatory names.

For more about PISCES, please see https://academic.oup.com/bioinformatics/article/19/12/1589/258419.

Acknowledgements

The peptide sequence and secondary structure are downloaded from https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz. The culled subset is downloaded from http://dunbrack.fccc.edu/PISCES.php.

Inspiration

Kaggle provides a great platform for sharing ideas and solving data science problem. Sharing a cleaned dataset help prevent others from duplicated work and also provides a common dataset for more comparable benchmark among different methods.

Early attempts on this (or related) problem:

Baldi, Pierre, Søren Brunak, Paolo Frasconi, Gianluca Pollastri and Giovanni Soda. “Bidirectional Dynamics for Protein Secondary Structure Prediction.” Sequence Learning (2001). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf

Chen, J. and Chaudhari, N. S.. "Protein Secondary Structure Prediction with bidirectional LSTM networks." Paper presented at the meeting of the Post-Conference Workshop on Computational Intelligence Approaches for the Analysis of Bio-data (CI-BIO), Montreal, Canada, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.7092&rep=rep1&type=pdf (Couldn't find a pdf)

Sepp Hochreiter, Martin Heusel, Klaus Obermayer; Fast model-based protein homology detection without alignment, Bioinformatics, Volume 23, Issue 14, 15 July 2007, Pages 1728–1736, https://doi.org/10.1093/bioinformatics/btm247
o
WORKSHOP: Introduction to Metabarcoding using QIIME2
explore.openaire.eu
Updated Feb 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashley Dungan; Gayle Philip; Andrew Perry; Rania Ismail; Laura Geissler; Kshitij Tandon; Igor Makunin (2022). WORKSHOP: Introduction to Metabarcoding using QIIME2 [Dataset]. http://doi.org/10.5281/zenodo.6350807
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6350807
Dataset updated
Feb 22, 2022
Authors
Ashley Dungan; Gayle Philip; Andrew Perry; Rania Ismail; Laura Geissler; Kshitij Tandon; Igor Makunin
Description
This record includes training materials associated with the Australian BioCommons workshop ��Introduction to Metabarcoding using QIIME2��. This workshop took place on 22 February 2022. Event description Metabarcoding has revolutionised the study of biodiversity science. By combining DNA taxonomy with high-throughput DNA sequencing, it offers the potential to observe a larger diversity in the taxa within a single sample, rapidly expanding the scope of microbial analysis and generating high-quality biodiversity data. This workshop will introduce the topic of metabarcoding and how you can use Qiime2 to analyse 16S data and gain simultaneous identification of all taxa within a sample. Qiime2 is a popular tool used to perform powerful microbiome analysis that can transform your raw data into publication quality visuals and statistics. In this workshop, using example 16S data from the shallow-water marine anemone E. diaphana, you will learn how to use this pipeline to run essential steps in microbial analysis including generating taxonomic assignments and phylogenic trees, and performing both alpha- and beta- diversity analysis. Materials are shared under a Creative Commons Attribution 4.0 International agreement unless otherwise specified and were current at the time of the event. Files and materials included in this record: Event metadata (PDF): Information about the event including, description, event URL, learning objectives, prerequisites, technical requirements etc. Index of training materials (PDF): List and description of all materials associated with this event including the name, format, location and a brief description of each file. Schedule (PDF): A breakdown of the topics and timings for the workshop Materials shared elsewhere: This workshop follows the tutorial ��Introduction to metabarcoding with QIIME2�� which has been made publicly available by Melbourne Bioinformatics. https://www.melbournebioinformatics.org.au/tutorials/tutorials/qiime2/qiime2/
r
Alternative Splicing Annotation Project II Database
rrid.site
neuinfo.org
+3more
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000322
Dataset updated
Jun 26, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
q
Bioinformatics: An Interactive Introduction to NCBI
qubeshub.org
Updated Jan 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seth Bordenstein (2019). Bioinformatics: An Interactive Introduction to NCBI [Dataset]. http://doi.org/10.25334/Q4915C
Explore at:
Unique identifier
https://doi.org/10.25334/Q4915C
Dataset updated
Jan 3, 2019
Dataset provided by
QUBES
Authors
Seth Bordenstein
Description
Modules showing how the NCBI database classifies and organizes information on DNA sequences, evolutionary relationships, and scientific publications. And a module working to identify a nucleotide sequence from an insect endosymbiont by using BLAST
d
RBG structural bioinformatics general files
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Levine, Tim P (2023). RBG structural bioinformatics general files [Dataset]. http://doi.org/10.7910/DVN/VSD9SS
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/VSD9SS
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Levine, Tim P
Description
Many different files for the project including: PDB files with CONSURF with Conservation Scores in the tempFactor field (low numbers = high conservation) • protein sequences (2 folders, all formatted for DNA Strider 1.5a12) DALI results for domains predicted by ColabFold • trees - phylogeny work in iTOL and Phyml (not included in paper)
f
Bayesian NrdA phylogeny
su.figshare.com
researchdata.se
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lundin (2023). Bayesian NrdA phylogeny [Dataset]. http://doi.org/10.17045/sthlmuni.11558187.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17045/sthlmuni.11558187.v1
Dataset updated
May 31, 2023
Dataset provided by
Stockholm University
Authors
Daniel Lundin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bayesian phylogeny of NrdA, class I ribonucleotide reductase catalytic component. Sequences from NCBI's RefSeq and Genbank databases (Haft et al. 2018; https://doi.org/10.1093/nar/gkx1068), downloaded March 2019, was searched with subclass specific HMMER (Eddy 2011; https://doi.org/10.1371/journal.pcbi.1002195) profiles for NrdA and NrdJ, class II RNR, serving as outgroup, (Lundin et al. in preparation). The resulting sequences were clustered at 60% identity with UCLUST (Edgar 2010; https://doi.org/10.1093/bioinformatics/btq461) to create a representative set of sequences. After manual inspection of sequences, 342 out of 27821 original NrdA sequences remained, plus 26 NrdJ sequences selected for aligning well to NrdA. The sequences were aligned with ProbCons (Do et al. 2005; https://doi.org/10.1101/gr.2821705) and 283 reliably aligned positions were selected with BMGE (Criscuolo & Gribaldo 2010; https://doi.org/10.1186/1471-2148-10-210) using the BLOSUM30 matrix. The alignment file is NrdA_uc0.60.NrdJ_uc0.30_outgroup.intr.correct.nolb.co.profile.BLOSUM30.bmge.mb.nxs. A bayesian phylogeny was estimated with MrBayes v. 3.2.6 (Ronquist & Huelsenbeck 2003; https://doi.org/10.1093/bioinformatics/btg180; https://github.com/NBISweden/MrBayes) using a gamma distribution for rate variation and rjMCMC to jump between amino acid models. MrBayes was run with four chains and five runs until average standard deviation of split frequencies reached 0.015. (See NrdA_uc0.60.NrdJ_uc0.30_outgroup.intr.correct.nolb.co.profile.BLOSUM30.bmge.mb.) The phylogeny, in Dendroscope (Huson et al. 2007; https://doi.org/10.1186/1471-2105-8-460) nexml format, isNrdA_uc0.60.NrdJ_uc0.30_outgroup.intr.correct.nolb.co.profile.BLOSUM30.bmge.mb.con.fullname.nexml .
r
k-Word matches: an alignment-free sequence comparison method
researchdata.edu.au
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Conrad J. Burden; Sylvain Forêt; Susan R. Wilson (2022). k-Word matches: an alignment-free sequence comparison method [Dataset]. http://doi.org/10.4225/03/5a1372cde0ad8
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a1372cde0ad8
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Conrad J. Burden; Sylvain Forêt; Susan R. Wilson
Description
k-word matches, the number of words of length k shared between two sequences, also known as the D2 statistic, are used in alignment-free sequence comparison statistic. The advantages of the use of this statistic over alignment-based methods for nucleotide and amino-acid sequence comparisons are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. We summarise our results to date on determing the distributional properties of the D2 statistic for a range of biologically relevant parameters and outline the directions in which the research will proceed. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
INSDC Host Organism Sequences
gbif.org
researchdata.edu.au
Updated Jul 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Host Organism Sequences [Dataset]. http://doi.org/10.15468/e97kmy
Explore at:
Unique identifier
https://doi.org/10.15468/e97kmy
Dataset updated
Jul 12, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Authors
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
d
Whole genome DNA sequences of Gulf of Mexico invertebrates
search.dataone.org
data.griidc.org
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas, W. Kelley (2025). Whole genome DNA sequences of Gulf of Mexico invertebrates [Dataset]. http://doi.org/10.7266/n7-pchj-dh15
Explore at:
Unique identifier
https://doi.org/10.7266/n7-pchj-dh15
Dataset updated
Feb 5, 2025
Dataset provided by
GRIIDC
Authors
Thomas, W. Kelley
Area covered
Gulf of Mexico (Gulf of America)
Description
The dataset consists of whole genome DNA sequences, generated from invertebrate species from the Gulf of Mexico during the Benthic Invertebrate Taxonomy, Metagenomics, and Bioinformatics Workshop (BITMaB) in 2017 in Corpus Christi, Texas, USA. All genomic data sets were deposited in and distributed by GenBank (NCBI), the European Nucleotide Archive (ENA)- European Bioinformatics Institute (EMBL-EBI), DNA Data Bank of Japan, NemATOL, the Global Genome Initiative, and Ocean Genome Legacy.
d
Raw motif mapping bedfile data and model training set class probabilities
search.dataone.org
datadryad.org
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phillip Davis (2025). Raw motif mapping bedfile data and model training set class probabilities [Dataset]. http://doi.org/10.5061/dryad.tdz08kq3w
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.tdz08kq3w
Dataset updated
May 6, 2025
Dataset provided by
Dryad Digital Repository
Authors
Phillip Davis
Time period covered
Jan 1, 2023
Description
Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a â€˜phenotype-of-concernâ€™ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information aloneÂ is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks.Â By leveraging a taxonomy-guided â€˜group-shuffle-splitâ€™ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (...

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). Bioinformatics Links Directory [Dataset]. http://identifiers.org/RRID:SCR_008018

Bioinformatics Links Directory

RRID:SCR_008018, nif-0000-10170, Bioinformatics Links Directory (RRID:SCR_008018), Bioinformatics Links Directory, Canadian Bioinformatics.ca Links Directory, Bioinformatics.ca Links Directory

Explore at:

189 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_008018

Dataset updated

Jan 29, 2022

Description

Database of curated links to molecular resources, tools and databases selected on the basis of recommendations from bioinformatics experts in the field. This resource relies on input from its community of bioinformatics users for suggestions. Starting in 2003, it has also started listing all links contained in the NAR Webserver issue. The different types of information available in this portal: * Computer Related: This category contains links to resources relating to programming languages often used in bioinformatics. Other tools of the trade, such as web development and database resources, are also included here. * Sequence Comparison: Tools and resources for the comparison of sequences including sequence similarity searching, alignment tools, and general comparative genomics resources. * DNA: This category contains links to useful resources for DNA sequence analyses such as tools for comparative sequence analysis and sequence assembly. Links to programs for sequence manipulation, primer design, and sequence retrieval and submission are also listed here. * Education: Links to information about the techniques, materials, people, places, and events of the greater bioinformatics community. Included are current news headlines, literature sources, educational material and links to bioinformatics courses and workshops. * Expression: Links to tools for predicting the expression, alternative splicing, and regulation of a gene sequence are found here. This section also contains links to databases, methods, and analysis tools for protein expression, SAGE, EST, and microarray data. * Human Genome: This section contains links to draft annotations of the human genome in addition to resources for sequence polymorphisms and genomics. Also included are links related to ethical discussions surrounding the study of the human genome. * Literature: Links to resources related to published literature, including tools to search for articles and through literature abstracts. Additional text mining resources, open access resources, and literature goldmines are also listed. * Model Organisms: Included in this category are links to resources for various model organisms ranging from mammals to microbes. These include databases and tools for genome scale analyses. * Other Molecules: Bioinformatics tools related to molecules other than DNA, RNA, and protein. This category will include resources for the bioinformatics of small molecules as well as for other biopolymers including carbohydrates and metabolites. * Protein: This category contains links to useful resources for protein sequence and structure analyses. Resources for phylogenetic analyses, prediction of protein features, and analyses of interactions are also found here. * RNA: Resources include links to sequence retrieval programs, structure prediction and visualization tools, motif search programs, and information on various functional RNAs.

Clear search

Close search

Google apps

Main menu

Bioinformatics Links Directory

Data from: PROSITE

DNA Detective: Genotype to Phenotype. A Bioinformatics Workshop for Middle...

The percentage identities and similarities of CSD between CfCSP and other...

PROSITE profiles

Data from: Differential hippocampal gene expression is associated with...

Data from: Highly significant improvement of protein sequence alignments...

Genomics England - Bioinformatics

INSDC Environment Sample Sequences

Data from: Ensembl TSS dataset for GRCh38

Protein Secondary Structure

Introduction

Dataset

Acknowledgements

Inspiration

Early attempts on this (or related) problem:

WORKSHOP: Introduction to Metabarcoding using QIIME2

Alternative Splicing Annotation Project II Database

Bioinformatics: An Interactive Introduction to NCBI

RBG structural bioinformatics general files

Bayesian NrdA phylogeny

k-Word matches: an alignment-free sequence comparison method

INSDC Host Organism Sequences

Whole genome DNA sequences of Gulf of Mexico invertebrates

Raw motif mapping bedfile data and model training set class probabilities

Bioinformatics Links Directory

RRID:SCR_008018, nif-0000-10170, Bioinformatics Links Directory (RRID:SCR_008018), Bioinformatics Links Directory, Canadian Bioinformatics.ca Links Directory, Bioinformatics.ca Links Directory