100+ datasets found
  1. Bioinformatics repository examples with good practices of using GitHub.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno (2023). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. http://doi.org/10.1371/journal.pcbi.1004947.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.

  2. ZBIT Bioinformatics Toolbox: A Web-Platform for Systems Biology and...

    • plos.figshare.com
    jpeg
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Römer; Johannes Eichner; Andreas Dräger; Clemens Wrzodek; Finja Wrzodek; Andreas Zell (2023). ZBIT Bioinformatics Toolbox: A Web-Platform for Systems Biology and Expression Data Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0149263
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Michael Römer; Johannes Eichner; Andreas Dräger; Clemens Wrzodek; Finja Wrzodek; Andreas Zell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bioinformatics analysis has become an integral part of research in biology. However, installation and use of scientific software can be difficult and often requires technical expert knowledge. Reasons are dependencies on certain operating systems or required third-party libraries, missing graphical user interfaces and documentation, or nonstandard input and output formats. In order to make bioinformatics software easily accessible to researchers, we here present a web-based platform. The Center for Bioinformatics Tuebingen (ZBIT) Bioinformatics Toolbox provides web-based access to a collection of bioinformatics tools developed for systems biology, protein sequence annotation, and expression data analysis. Currently, the collection encompasses software for conversion and processing of community standards SBML and BioPAX, transcription factor analysis, and analysis of microarray data from transcriptomics and proteomics studies. All tools are hosted on a customized Galaxy instance and run on a dedicated computation cluster. Users only need a web browser and an active internet connection in order to benefit from this service. The web platform is designed to facilitate the usage of the bioinformatics tools for researchers without advanced technical background. Users can combine tools for complex analyses or use predefined, customizable workflows. All results are stored persistently and reproducible. For each tool, we provide documentation, tutorials, and example data to maximize usability. The ZBIT Bioinformatics Toolbox is freely available at https://webservices.cs.uni-tuebingen.de/.

  3. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  4. s

    MINUTE-ChIP example data

    • figshare.scilifelab.se
    txt
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Navarro Luzon; Simon Elsässer (2025). MINUTE-ChIP example data [Dataset]. http://doi.org/10.17044/scilifelab.25348405.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Carmen Navarro Luzon; Simon Elsässer
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This collection contains an example MINUTE-ChIP dataset to run minute pipeline on, provided as supporting material to help users understand the results of a MINUTE-ChIP experiment from raw data to a primary analysis that yields the relevant files for downstream analysis along with summarized QC indicators. Example primary non-demultiplexed FASTQ files provided here were used to generate GSM5493452-GSM5493463 (H3K27m3) and GSM5823907-GSM5823918 (Input), deposited on GEO with the minute pipeline all together under series GSE181241. For more information about MINUTE-ChIP, you can check the publication relevant to this dataset: Kumar, Banushree, et al. "Polycomb repressive complex 2 shields naïve human pluripotent cells from trophectoderm differentiation." Nature Cell Biology 24.6 (2022): 845-857. If you want more information about the minute pipeline, there is a public biorXiv and a GitHub repository and official documentation.

  5. Example datasets for AlphaCRV

    • zenodo.org
    application/gzip, bin
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco J. Guzmán-Vega; Francisco J. Guzmán-Vega (2024). Example datasets for AlphaCRV [Dataset]. http://doi.org/10.5281/zenodo.10470744
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco J. Guzmán-Vega; Francisco J. Guzmán-Vega
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Time period covered
    Jan 8, 2024
    Description

    Example datasets for AlphaCRV: A Pipeline for Identifying Accurate Binder Topologies in Mass-Modeling with AlphaFold. With these datasets you can replicate two of the three examples shown in the paper, following along with the Jupyter notebooks in the GitHub repository at https://github.com/strubelab/AlphaCRV. Description:

    • AVRPia.fasta, AVRPik.fasta, SKP1.fasta: Sequences of the bait molecules for the three examples.
    • AVRPia_binders.fasta, AVRPik_binders.fasta, SKP1_binders.fasta: Sequences of the binder molecules for the three examples.
    • AVRPia_vs_rice_models.tar.lzma, AVRPik_vs_rice_models.tar.lzma: Compressed archives of the AlphaFold-Multimer models for the AVRPia and AVRPik examples. The 712 complexes for SKP1 are more than 100GB in size, so those can be provided upon request. To decompress the .tar.lzma archives use the following two commands on each:
    unlzma AVRPia_vs_rice_models.tar.lzma
    tar -xvf AVRPia_vs_rice_models.tar
    • AVRPia_vs_rice_clusters.tar.gz, AVRPik_vs_rice_clusters.tar.gz, SKP1_vs_rice_clusters.tar.gz: Compressed archives with the results from running AlphaCRV on the three examples presented in the paper. To decompress these archives use the following command on each:
    tar -xvzf AVRPia_vs_rice_clusters.tar.gz
  6. f

    Data from: Advancing computational biology and bioinformatics research...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
    Explore at:
    Dataset updated
    Sep 27, 2019
    Authors
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
    Description

    Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.

  7. temporary examples

    • figshare.com
    xlsx
    Updated Dec 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Dale (2018). temporary examples [Dataset]. http://doi.org/10.6084/m9.figshare.7470083.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 15, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ryan Dale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example files to test URL handling

  8. q

    Making toast: Using analogies to explore concepts in bioinformatics

    • qubeshub.org
    Updated Aug 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate Hertweck (2021). Making toast: Using analogies to explore concepts in bioinformatics [Dataset]. http://doi.org/10.24918/cs.2016.11
    Explore at:
    Dataset updated
    Aug 26, 2021
    Dataset provided by
    QUBES
    Authors
    Kate Hertweck
    Description

    Contemporary biology is moving towards heavy reliance on computational methods to manage, find patterns, and derive meaning from large-scale data, such as genomic sequences. Biology teachers are increasingly compelled to prepare students with skills to meet these challenges. However, introducing biology students to more abstract concepts associated with computational thinking remains a major challenge. Analogies have long been used in science classrooms to help students comprehend complex concepts by relating them to familiar processes. Here I present a multi-step procedure for introducing students to large-scale data analysis (bioinformatics workflows) by asking them to describe a common daily task: making toast. First, students describe the main steps associated with this procedure. Next, students are presented with alternative scenarios for materials and equipment and are asked to extend the analogy to accommodate them. Finally, students are led through examples of how the analogy breaks down, or fails to accurately represent, a bioinformatics analysis. This structured approach to student exploration of analogies related to computational biology capitalizes on diverse student experiences to both clarify concepts and ameliorate possible misconceptions. Similar methods can be used to introduce many abstract concepts in both biology and computer science.

  9. r

    Data from: Spectrum analysis based method for dynamics and collective...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu (2022). Spectrum analysis based method for dynamics and collective analysis of protein-protein interaction networks [Dataset]. http://doi.org/10.4225/03/5a13725619374
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu
    Description

    The importance of understanding biological interaction networks has fueled the development of numerous interaction data generation techniques, databases and prediction tools. Generation of high-confident interaction networks formulates the first step towards the study for protein–protein interactions (PPI). A number of experimental methods, based on distinct, physical principles have been developed to identify PPI such as the yeast two-hybrid method (Y2H). In this work, we focus on one example of biological networks, namely the yeast protein interaction network (YPIN). In YPIN, we design and implement a computational model that captures the discrete and stochastic nature of protein interactions. In this model, we apply spectrum analysis method to the variance of the protein nodes which play an important role in the PPI networks, which can show the topology structure of dynamic and collective performances of PPI networks. We take YPIN, such as 48 "quasi-cliques" and 6 "quasi-bipartites" separated from 11855 yeast PPI networks with 2617 proteins, as an example and apply spectrum analysis to show the topology structure of dynamic and collective analysis of PPI networks and the performances. The obtained results may be valuable for deciphering unknown protein functions, determining protein complexes, and inventing drugs. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  10. Examples datasets for Microbiology

    • zenodo.org
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Cordier; Tristan Cordier (2020). Examples datasets for Microbiology [Dataset]. http://doi.org/10.5281/zenodo.2605445
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tristan Cordier; Tristan Cordier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three examples dataset to perform bioinformatics analysis.

  11. Example File 1.txt

    • figshare.com
    txt
    Updated Apr 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vafaee Lab (2020). Example File 1.txt [Dataset]. http://doi.org/10.6084/m9.figshare.12200138.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 27, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Vafaee Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example files to run DrugSimDB interface

  12. f

    Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 11, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng (2014). “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001209841
    Explore at:
    Dataset updated
    Dec 11, 2014
    Authors
    Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng
    Description

    “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education

  13. Bakta Annotation Examples

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Nov 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Schwengers; Oliver Schwengers (2021). Bakta Annotation Examples [Dataset]. http://doi.org/10.5281/zenodo.4922840
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 10, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Oliver Schwengers; Oliver Schwengers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data repository provides exemplary bacterial genome annotations conducted with Bakta of a broad taxonomical range of genomes comprising many pathogens (all ESKAPE), commensals and environmental species.

    Bakta is a tool for the rapid & standardized local annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readble JSON & bioinformatics standard file formats for automatic downstream analysis: https://github.com/oschwengers/bakta

  14. m

    Principles and steps for integrating bioinformatics

    • data.mendeley.com
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hang Thi Nguyen (2024). Principles and steps for integrating bioinformatics [Dataset]. http://doi.org/10.17632/wjx5h7wh22.3
    Explore at:
    Dataset updated
    Aug 7, 2024
    Authors
    Hang Thi Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Biological data is increasing at a high speed, creating a vast amount of knowledge, while updating knowledge in teaching is limited, along with the unchanged time in the classroom. Therefore, integrating bioinformatics into teaching will be effective in teaching biology today. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it in biology teaching. This dataset includes survey results on high school teachers, teacher training curriculums and pedagogical students in Vietnam. The highlights of this dataset are six basic principles and four steps of bioinformatics integration in teaching biology at high schools, with illustrative examples. The principles and approaches of integrating Bioinformatics into biology teaching improve the quality of biology teaching and promote STEM education in Vietnam and developing countries.

  15. Prediction of multi-drug resistance transporters dataset

    • search.datacite.org
    • f1000.figshare.com
    Updated Jun 5, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason E. McDermott; Paul Bruillard; Christopher C. Overall; Luke Gosink; Stephen R. Lindemann (2015). Prediction of multi-drug resistance transporters dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1415804
    Explore at:
    Dataset updated
    Jun 5, 2015
    Dataset provided by
    DataCitehttps://www.datacite.org/
    f1000research.com
    Authors
    Jason E. McDermott; Paul Bruillard; Christopher C. Overall; Luke Gosink; Stephen R. Lindemann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data file 1
    Title: Data File PROSITE_positives_PS000125.fasta.
    Legend: Sequence file in FASTA format of all positive examples for the ser/thr phosphatase model. Data file 2
    Data File PROSITE_negatives_PS000125.fasta.
    Sequence file in FASTA format of all randomly selected negative examples for the ser/thr phosphatase model." Data file 3
    Data File PROSITE_positives_PS00028.fasta.
    Sequence file in FASTA format of all positive examples for the zinc finger model. Data file 4
    Data File PROSITE_negatives_PS00028.fasta.
    Sequence file in FASTA format of all randomly selected negative examples for the zinc finger model. Data file 5
    Data File PROSITE_PS00125.txt.
    PROSITE record used for the ser/thr phosphatase model. Data file 6
    Data File PROSITE_PS00028.txt.
    PROSITE record used for the zinc finger model. Data file 7
    Data File MDR_TCDB_positives.fasta.
    Sequence file of MDR transporters used for training. FASTA format file of positive examples used in this study derived from the TCDB. Data file 8
    Data File MDR_TCDB_negatives.fasta.
    Sequence file of non-MDR transporters used for training. FASTA format file of negative examples used in this study derived from the TCDB. Data file 9
    Data File PILGram_PATTERNS_PS00125.txt.
    Regular expression generated by PILGram for the ser/thr phosphatase model. Data file 10
    Data File PS00125_alignments.out.
    Sequence alignments of PILGram model matches to the positive examples in the ser/thr phosphatase model. Data file 11
    Data File PILGram_PATTERNS_PS00028.txt.
    Regular expressions generated by PILGram for the zinc finger model. Data file 12
    Data File PS00028_alignments.out.
    Sequence alignments of PILGram model matches to the positive examples in the zinc finger model and a summary score line that represents the overlap of the 10 different models for each sequence. Data file 13
    Data File PILGram_PATTERNS_MDRpred.txt.
    The 36 regular expressions and associated physiochemical properties (where applicable) generated by PILGram for the MDR model . Data file 14
    Data File MDRpred_alignments.out.
    Alignments of 36 PILGram model matches on the MDR positive example sequences. Data file 15
    Data File Pfam_transporters.txt.
    A list of Pfam families that were used to identify transporters in the Hot Lake metagenome. Data file 16
    Data File HotLake_MDRpred_predictions.fasta.
    A FASTA format file of 63 protein sequences from the Hot Lake metagenome that are matched by 30 or more MDRpred individual models (high confidence predictions), match Pfam families for transporters (Pfam e-value less than 1e-20), but are not identified by Pfam as multidrug resistance transporters.

  16. Ensembl TSS dataset for GRCh38

    • zenodo.org
    • portalcienciaytecnologia.jcyl.es
    • +2more
    bin
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

    First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

    Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
    et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
    idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
    as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.

  17. f

    Additional file 1 of Biotite: new tools for a versatile Python...

    • datasetcatalog.nlm.nih.gov
    • springernature.figshare.com
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anter, Jacob Marcel; Krumbach, Jan Hendrik; Kunzmann, Patrick; Hamacher, Kay; Müller, Tom David; Greil, Maximilian; Bauer, Daniel; Islam, Faisal (2024). Additional file 1 of Biotite: new tools for a versatile Python bioinformatics library [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001475055
    Explore at:
    Dataset updated
    Aug 13, 2024
    Authors
    Anter, Jacob Marcel; Krumbach, Jan Hendrik; Kunzmann, Patrick; Hamacher, Kay; Müller, Tom David; Greil, Maximilian; Bauer, Daniel; Islam, Faisal
    Description

    Additional file 1. Application example Juypter notebooks.

  18. q

    Hemoglobin bioinformatics

    • qubeshub.org
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keith Johnson (2021). Hemoglobin bioinformatics [Dataset]. http://doi.org/10.25334/MMEY-8321
    Explore at:
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    QUBES
    Authors
    Keith Johnson
    Description

    This is an introduction to bioinformatics using hemoglobin as an example. The worksheets introduce students to resources to explore the DNA, RNA and polypeptide linear structure with a brief introduction to the quaternary structure of hemoglobin.

  19. CNB Lipid Rafts reAnalysis. - This dataset represents an example for a...

    • data.niaid.nih.gov
    • ebi.ac.uk
    xml
    Updated Jul 14, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Alberto Medina-Aunon; J.P. Albar (2014). CNB Lipid Rafts reAnalysis. - This dataset represents an example for a textbook: Genomics and Proteomics for Clinical Discovery and Development [Dataset]. https://data.niaid.nih.gov/resources?id=pxd000744
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jul 14, 2014
    Dataset provided by
    Proteomics - Bioinformatics
    CNB/CSIC - ProteoRed
    Authors
    J. Alberto Medina-Aunon; J.P. Albar
    Variables measured
    Proteomics
    Description

    This dataset represents an example for a textbook: Genomics and Proteomics for Clinical Discovery and Development, http://www.springer.com/life+sciences/systems+biology+and+bioinformatics/book/978-94-017-9201-1 ReAnalysis of the work Proteomic analysis of apical microvillous membranes of syncytiotrophoblast cells reveals a high degree of similarity with lipid rafts. Experiment described at: Paradela et al. J Proteome Res. 2005 Nov-Dec;4(6):2435-41 PMID: 16335998. For protein annotations, revise, Medina-Aunon et at. Proteomics. 2010 Sep;10(18):3262-71 PMID: 20707001. Genomics and Proteomics for Clinical Discovery and Development Translational Bioinformatics Volume 6, 2014, pp 41-68

  20. m

    CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

    • data.mendeley.com
    • data.niaid.nih.gov
    • +3more
    Updated Dec 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farah Zaib Khan (2018). CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object) [Dataset]. http://doi.org/10.17632/xnwncxpw42.1
    Explore at:
    Dataset updated
    Dec 4, 2018
    Authors
    Farah Zaib Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:

    1. Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.
    2. The Genome BAM file is processed using Picard MarkDuplicates producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).
    3. SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.
    4. The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.
    5. In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.

    For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.

    This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno (2023). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. http://doi.org/10.1371/journal.pcbi.1004947.t001
Organization logo

Bioinformatics repository examples with good practices of using GitHub.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.

Search
Clear search
Close search
Google apps
Main menu