100+ datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. s

    MINUTE-ChIP example data

    • figshare.scilifelab.se
    txt
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Navarro Luzon; Simon Elsässer (2025). MINUTE-ChIP example data [Dataset]. http://doi.org/10.17044/scilifelab.25348405.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Carmen Navarro Luzon; Simon Elsässer
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This collection contains an example MINUTE-ChIP dataset to run minute pipeline on, provided as supporting material to help users understand the results of a MINUTE-ChIP experiment from raw data to a primary analysis that yields the relevant files for downstream analysis along with summarized QC indicators. Example primary non-demultiplexed FASTQ files provided here were used to generate GSM5493452-GSM5493463 (H3K27m3) and GSM5823907-GSM5823918 (Input), deposited on GEO with the minute pipeline all together under series GSE181241. For more information about MINUTE-ChIP, you can check the publication relevant to this dataset: Kumar, Banushree, et al. "Polycomb repressive complex 2 shields naïve human pluripotent cells from trophectoderm differentiation." Nature Cell Biology 24.6 (2022): 845-857. If you want more information about the minute pipeline, there is a public biorXiv and a GitHub repository and official documentation.

  3. Example dataset annotated with bacannot

    • figshare.com
    application/x-gzip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Almeida (2023). Example dataset annotated with bacannot [Dataset]. http://doi.org/10.6084/m9.figshare.14160590.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Felipe Almeida
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    This dataset represents the https://github.com/PacificBiosciences/DevNet/wiki/8-plex-Ecoli-Multiplexed-Microbial-Assembly pacbio dataset already assembled and annotated so that users that want to skip some steps of the tutorial can do it by downloading this dataset.

  4. Example dataset for the CommPath R package tutorial

    • figshare.com
    application/gzip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu (2023). Example dataset for the CommPath R package tutorial [Dataset]. http://doi.org/10.6084/m9.figshare.19090553.v5
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Lu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the example dataset used in the CommPath R package tutorial. We downloaded the processed scRNA-seq data on hepatocellular carcinoma (HCC) samples from the paper of Sharma, et al., 2020, and then randomly selected the expression data of 3000 cells across the top 5000 highly variable genes from the tumor and normal tissues, respectively.

  5. Examples datasets for Microbiology

    • zenodo.org
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Cordier; Tristan Cordier (2020). Examples datasets for Microbiology [Dataset]. http://doi.org/10.5281/zenodo.2605445
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tristan Cordier; Tristan Cordier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three examples dataset to perform bioinformatics analysis.

  6. Example datasets for AlphaCRV

    • zenodo.org
    application/gzip, bin
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco J. Guzmán-Vega; Francisco J. Guzmán-Vega (2024). Example datasets for AlphaCRV [Dataset]. http://doi.org/10.5281/zenodo.10470744
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco J. Guzmán-Vega; Francisco J. Guzmán-Vega
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Time period covered
    Jan 8, 2024
    Description

    Example datasets for AlphaCRV: A Pipeline for Identifying Accurate Binder Topologies in Mass-Modeling with AlphaFold. With these datasets you can replicate two of the three examples shown in the paper, following along with the Jupyter notebooks in the GitHub repository at https://github.com/strubelab/AlphaCRV. Description:

    • AVRPia.fasta, AVRPik.fasta, SKP1.fasta: Sequences of the bait molecules for the three examples.
    • AVRPia_binders.fasta, AVRPik_binders.fasta, SKP1_binders.fasta: Sequences of the binder molecules for the three examples.
    • AVRPia_vs_rice_models.tar.lzma, AVRPik_vs_rice_models.tar.lzma: Compressed archives of the AlphaFold-Multimer models for the AVRPia and AVRPik examples. The 712 complexes for SKP1 are more than 100GB in size, so those can be provided upon request. To decompress the .tar.lzma archives use the following two commands on each:
    unlzma AVRPia_vs_rice_models.tar.lzma
    tar -xvf AVRPia_vs_rice_models.tar
    • AVRPia_vs_rice_clusters.tar.gz, AVRPik_vs_rice_clusters.tar.gz, SKP1_vs_rice_clusters.tar.gz: Compressed archives with the results from running AlphaCRV on the three examples presented in the paper. To decompress these archives use the following command on each:
    tar -xvzf AVRPia_vs_rice_clusters.tar.gz
  7. DNA Classification dataset

    • kaggle.com
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). DNA Classification dataset [Dataset]. https://www.kaggle.com/datasets/miadul/dna-classification-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 3,000 synthetic DNA samples with 13 features designed for genomic data analysis, machine learning, and bioinformatics research. Each row represents a unique DNA sample with both sequence-level and statistical attributes.

    🔹 Dataset Structure

    Rows: 3,000

    Columns: 13

    🔹 Features Description

    1. Sample_ID → Unique identifier for each DNA sample

    2. Sequence → DNA sequence (string of A, T, C, G)

    3. GC_Content → Percentage of Guanine (G) and Cytosine (C) in the sequence

    4. AT_Content → Percentage of Adenine (A) and Thymine (T) in the sequence

    5. Sequence_Length → Total sequence length

    6. Num_A → Number of Adenine bases

    7. Num_T → Number of Thymine bases

    8. Num_C → Number of Cytosine bases

    9. Num_G → Number of Guanine bases

    10. kmer_3_freq → Average 3-mer (triplet) frequency score

    11. Mutation_Flag → Binary flag indicating mutation presence (0 = No, 1 = Yes)

    12. Class_Label → Class of the sample (Human, Bacteria, Virus, Plant)

    13. Disease_Risk → Risk level associated with the sample (Low / Medium / High)

    🔹 Potential Use Cases

    DNA classification tasks (e.g., predicting species from DNA sequence features)

    Exploratory Data Analysis (EDA) in bioinformatics

    Machine Learning model development (Logistic Regression, Random Forest, SVM, Neural Networks)

    Deep Learning approaches (LSTM, CNN, Transformers for sequence learning)

    Mutation detection and disease risk analysis

    Teaching and practicing biological data preprocessing techniques

    🔹 Why This Dataset?

    Synthetic but realistic structure, inspired by genomics data

    Balanced and diverse distribution of features and labels

    Suitable for beginners and researchers to practice classification, visualization, and model comparison

    🔹 Example Research Questions

    Can we classify DNA samples into their biological class using sequence-based features?

    How does GC content relate to mutation risk?

    Which ML model performs best for DNA classification tasks?

    Can synthetic DNA features predict disease risk categories?

    📌 Acknowledgment

    This dataset is synthetic and generated for educational & research purposes. It does not represent real patient data.

  8. Example dataset for singleCellHaystack GitHub vignettes (2020)

    • figshare.com
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Vandenbon (2023). Example dataset for singleCellHaystack GitHub vignettes (2020) [Dataset]. http://doi.org/10.6084/m9.figshare.12923489.v4
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Alexis Vandenbon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An example single-cell dataset in .rda format for use in the vignettes of the SingleCellHaystack on GitHub. Individual datasets included in the .rda file are also present as .csv files.

  9. Ensembl TSS dataset for GRCh38

    • zenodo.org
    • portalcienciaytecnologia.jcyl.es
    • +2more
    bin
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

    First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

    Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
    et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
    idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
    as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.

  10. f

    Bioinformatics repository examples with good practices of using GitHub.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 13, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pollard, Tom J.; da Veiga Leprevost, Felipe; Vizcaíno, Juan Antonio; Flight, Robert M.; Eglen, Stephen J.; Ternent, Tobias; Blin, Kai; Perez-Riverol, Yasset; Gatto, Laurent; Konovalov, Alexander; Wang, Rui; Uszkoreit, Julian; Sachsenberg, Timo; Fufezan, Christian; Katz, Daniel S. (2016). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001509925
    Explore at:
    Dataset updated
    Oct 13, 2016
    Authors
    Pollard, Tom J.; da Veiga Leprevost, Felipe; Vizcaíno, Juan Antonio; Flight, Robert M.; Eglen, Stephen J.; Ternent, Tobias; Blin, Kai; Perez-Riverol, Yasset; Gatto, Laurent; Konovalov, Alexander; Wang, Rui; Uszkoreit, Julian; Sachsenberg, Timo; Fufezan, Christian; Katz, Daniel S.
    Description

    The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.

  11. f

    Data from: Advancing computational biology and bioinformatics research...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
    Explore at:
    Dataset updated
    Sep 27, 2019
    Authors
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
    Description

    Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.

  12. Metagenomes example datasets

    • figshare.com
    application/gzip
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelly Hidalgo (2022). Metagenomes example datasets [Dataset]. http://doi.org/10.6084/m9.figshare.19015058.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Kelly Hidalgo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example dataset for bioinformatic tutorials

  13. f

    Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 11, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng (2014). “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001209841
    Explore at:
    Dataset updated
    Dec 11, 2014
    Authors
    Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng
    Description

    “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education

  14. glycation

    • kaggle.com
    zip
    Updated Sep 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Husnul Abid (2020). glycation [Dataset]. https://www.kaggle.com/husnulabid02/glycation
    Explore at:
    zip(2271419 bytes)Available download formats
    Dataset updated
    Sep 20, 2020
    Authors
    Husnul Abid
    Description

    A BioInformatics Dataset where it's possible to know about whether Lysine (residue 'k') is glycated or not. All the position of Lysine (alphabet k) given on 'Accession' is positive sites. Other Lysine(k) on every sequence is negative sites. You can consider window size as 15 (7 upstream, 'k' in middle, 7 downstream).

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3928176%2Feec4b7e199825d4d4b687585d61cec2b%2FCapture.PNG?generation=1600608829159271&alt=media" alt="">

    Example after pre-processing : | Sequence | label | | ---| --- | |AKSHPPDKWAQGAGA | 1 | |GILPILVKCLERDDN | 1 | |PSSNPHAKPSDFHFL| 0 | |EEVFYAVKVLQKKAI| 0 |

    There are around 6557 positive sites and 174023 negative sites. So predicting would be tough for an imbalance dataset. you can apply CD-HIT to reduce the number of negative sites. To extract features you can use CKSAAP or PSI-BLAST. It's a BioInformatics problem, not a general one. So it's better if you have domain knowledge or about protein structure.

    Dataset link: http://plmd.biocuckoo.org/download.php

  15. Example dataset preprocessed with ngs-preprocess

    • figshare.com
    application/x-gzip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Almeida (2023). Example dataset preprocessed with ngs-preprocess [Dataset]. http://doi.org/10.6084/m9.figshare.14160641.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Felipe Almeida
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset represents the https://github.com/PacificBiosciences/DevNet/wiki/8-plex-Ecoli-Multiplexed-Microbial-Assembly pacbio dataset already demultiplexed and preprocessed in fastq format so that users that want to skip some steps of the tutorial can do it by downloading this dataset.

  16. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Networkhttp://www.hptn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  17. m

    CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

    • data.mendeley.com
    • data.niaid.nih.gov
    • +3more
    Updated Dec 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farah Zaib Khan (2018). CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object) [Dataset]. http://doi.org/10.17632/xnwncxpw42.1
    Explore at:
    Dataset updated
    Dec 4, 2018
    Authors
    Farah Zaib Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:

    1. Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.
    2. The Genome BAM file is processed using Picard MarkDuplicates producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).
    3. SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.
    4. The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.
    5. In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.

    For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.

    This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

  18. m

    Principles and steps for integrating bioinformatics

    • data.mendeley.com
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hang Thi Nguyen (2024). Principles and steps for integrating bioinformatics [Dataset]. http://doi.org/10.17632/wjx5h7wh22.3
    Explore at:
    Dataset updated
    Aug 7, 2024
    Authors
    Hang Thi Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Biological data is increasing at a high speed, creating a vast amount of knowledge, while updating knowledge in teaching is limited, along with the unchanged time in the classroom. Therefore, integrating bioinformatics into teaching will be effective in teaching biology today. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it in biology teaching. This dataset includes survey results on high school teachers, teacher training curriculums and pedagogical students in Vietnam. The highlights of this dataset are six basic principles and four steps of bioinformatics integration in teaching biology at high schools, with illustrative examples. The principles and approaches of integrating Bioinformatics into biology teaching improve the quality of biology teaching and promote STEM education in Vietnam and developing countries.

  19. e

    CNB Lipid Rafts reAnalysis. - This dataset represents an example for a...

    • ebi.ac.uk
    • data.niaid.nih.gov
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Alberto Medina-Aunon, CNB Lipid Rafts reAnalysis. - This dataset represents an example for a textbook: Genomics and Proteomics for Clinical Discovery and Development [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD000744
    Explore at:
    Authors
    J. Alberto Medina-Aunon
    Variables measured
    Proteomics
    Description

    This dataset represents an example for a textbook: Genomics and Proteomics for Clinical Discovery and Development, http://www.springer.com/life+sciences/systems+biology+and+bioinformatics/book/978-94-017-9201-1 ReAnalysis of the work Proteomic analysis of apical microvillous membranes of syncytiotrophoblast cells reveals a high degree of similarity with lipid rafts. Experiment described at: Paradela et al. J Proteome Res. 2005 Nov-Dec;4(6):2435-41 PMID: 16335998. For protein annotations, revise, Medina-Aunon et at. Proteomics. 2010 Sep;10(18):3262-71 PMID: 20707001. Genomics and Proteomics for Clinical Discovery and Development Translational Bioinformatics Volume 6, 2014, pp 41-68

  20. m

    Sample scRNA-seq Data for Cell Type Annotation

    • mllmcelltype.com
    csv, xls
    Updated Jan 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mLLMCelltype Research Team (2025). Sample scRNA-seq Data for Cell Type Annotation [Dataset]. http://doi.org/10.5281/zenodo.mllmcelltype-sample
    Explore at:
    csv, xlsAvailable download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    mLLMCelltype
    Authors
    mLLMCelltype Research Team
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    2024 - 2025
    Area covered
    Global
    Variables measured
    Gene Expression, Cell Type Markers
    Measurement technique
    Single-cell RNA sequencing (scRNA-seq)
    Description

    Comprehensive example single-cell RNA sequencing dataset with marker genes specifically designed for testing and demonstrating AI-powered cell type annotation capabilities. This dataset includes representative cell clusters with known markers for validation purposes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu