100+ datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. Bioinformatics Collection

    • kaggle.com
    zip
    Updated Jul 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Shtrauss (2022). Bioinformatics Collection [Dataset]. https://www.kaggle.com/shtrausslearning/bioinformatics
    Explore at:
    zip(155016771 bytes)Available download formats
    Dataset updated
    Jul 31, 2022
    Authors
    Andrey Shtrauss
    Description

    Dataset

    This dataset was created by Andrey Shtrauss

    Contents

  3. Protocol data (R version)

    • figshare.com
    application/gzip
    Updated Oct 16, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesse Gillis (2020). Protocol data (R version) [Dataset]. http://doi.org/10.6084/m9.figshare.13020569.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 16, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jesse Gillis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We published 3 protocols illustrating how MetaNeighbor can be used to quantify cell type replicability across single cell transcriptomic datasets.The data files included here are needed to run the R version of the protocols available on Github (https://github.com/gillislab/MetaNeighbor-Protocol) in RMarkdown (.Rmd) and Jupyter (.ipynb) notebook format. To run the protocols, download the protocols on Github, download the data on Figshare, place the data and protocol files in the same directory, then run the notebooks in Rstudio or Jupyter.The scripts used to generate the data are included in the Github directory. Briefly: - full_biccn_hvg.rds contains a single cell transcriptomic dataset published by the Brain Initiative Cell Census Network (in SingleCellExperiment format). It combines data from 7 datasets obtained in the mouse primary motor cortex (https://www.biorxiv.org/content/10.1101/2020.02.29.970558v2). Note that this dataset only contains highly variable genes. - biccn_hvgs.txt: highly variable genes from the BICCN dataset described above (computed with the MetaNeighbor library). - biccn_gaba.rds: same dataset as full_biccn_hvg.rds, but restricted to GABAergic neurons. The dataset contains all genes common to the 7 BICCN datasets (not just highly variable genes). - go_mouse.rds: gene ontology annotations, stored as a list of gene symbols (one element per gene set).- functional_aurocs.txt: results of the MetaNeighbor functional analysis in protocol 3.

  4. Data from: Semi-artificial datasets as a resource for validation of...

    • zenodo.org
    • search.dataone.org
    • +1more
    zip
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Tamisier; Lucie Tamisier; Annelies Haegeman; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Denis Kutnjak; Sébastien Massart; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Sébastien Massart (2022). Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection [Dataset]. http://doi.org/10.5061/dryad.0zpc866z8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucie Tamisier; Lucie Tamisier; Annelies Haegeman; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Denis Kutnjak; Sébastien Massart; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Sébastien Massart
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes.

    Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a "real" HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome).

    Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset.

    A GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) is available and provides a complete description of the composition of each dataset, the methods used to create them and their goals.

  5. I

    Molecular Biology Databases Published in Nucleic Acids Research between...

    • databank.illinois.edu
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heidi Imker (2024). Molecular Biology Databases Published in Nucleic Acids Research between 1991-2016 [Dataset]. http://doi.org/10.13012/B2IDB-4311325_V1
    Explore at:
    Dataset updated
    Feb 1, 2024
    Authors
    Heidi Imker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.

  6. d

    Data from: Use of long-read sequencing simulators to assess real-world...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Dec 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Use of long-read sequencing simulators to assess real-world applications for food safety [Dataset]. https://catalog.data.gov/dataset/data-from-use-of-long-read-sequencing-simulators-to-assess-real-world-applications-for-foo-35d38
    Explore at:
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Agricultural Research Service
    Area covered
    World
    Description

    Shiga toxin-producing Escherichia coli (STEC) and Listeria monocytogenes are responsible for severe foodborne illnesses in the United States. Current identification methods require at least four days to identify STEC and six days for L. monocytogenes. Adoption of long-read, whole genome sequencing for testing could significantly reduce the time needed for identification, but method development costs are high. Therefore, the goal of this project was to use NanoSim-H software to simulate Oxford Nanopore sequencing reads to assess the feasibility of sequencing-based foodborne pathogen detection and guide experimental design. Sequencing reads were simulated for STEC, L. monocytogenes, and a 1:1 combination of STEC and Bos taurus genomes using NanoSim-H. This dataset includes all of the simulated reads generated by the project in fasta format. This dataset can be analyzed bioinformatically or used to test bioinformatic pipelines.

  7. R

    RMQS1 16S bioinformatic config files and control sample data

    • entrepot.recherche.data.gouv.fr
    application/gzip, tsv +1
    Updated Aug 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt (2024). RMQS1 16S bioinformatic config files and control sample data [Dataset]. http://doi.org/10.57745/XBFOJP
    Explore at:
    tsv(522347), txt(143493), tsv(8814), tsv(33093), tsv(117004), application/gzip(362535), tsv(13212), tsv(32344), tsv(266094), tsv(80032), txt(10413), tsv(16460)Available download formats
    Dataset updated
    Aug 22, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    French National Research Agency (ANR)
    France Génomique
    French Agency for Ecological Transition (ADEME)
    Description

    RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure: Taxonomy files rmqs1_control_taxonomy_: Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present: Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name. rmqs1_16S_otu_abundance.tsv: OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold). rmqs1_16S_bank_association.tsv: two columns file with bank name for each sample rmqs1_16S_bank_metadata.tsv: library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library Input_G4.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_G4.tab: Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Input_GLOBAL.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_GLOBAL.tab: Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Details: Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.

  8. Ensembl TSS dataset for GRCh38

    • zenodo.org
    • investigacion.ubu.es
    • +1more
    bin
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

    First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

    Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
    et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
    idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
    as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.

  9. End to End Bioinformatics Pipeline GSE57691

    • kaggle.com
    zip
    Updated Dec 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). End to End Bioinformatics Pipeline GSE57691 [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/end-to-end-bioinformatics-pipeline-gse57691
    Explore at:
    zip(6093346 bytes)Available download formats
    Dataset updated
    Dec 8, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    • This dataset provides a complete end-to-end bioinformatics workflow built around the GSE57691 gene expression dataset.
    • It includes all core steps typically used in transcriptomic data analysis for differential gene expression studies.
    • Raw and intermediate files are organized to demonstrate a reproducible and transparent pipeline structure.
    • The dataset covers data acquisition, preprocessing, normalization, and probe-to-gene annotation steps.
    • Quality control analyses such as boxplots, density plots, PCA plots, and sample clustering are included.
    • Differential expression analysis is performed using standard statistical methods suitable for microarray platforms.
    • Outputs include tables of significantly upregulated and downregulated genes with adjusted p-values and fold changes.
    • Multiple visualization assets are included to help interpret biological significance of detected expression changes.
    • Plots include volcano plots, heatmaps, MA plots, and exploratory QC figures.
    • The workflow demonstrates how to identify meaningful gene expression patterns between experimental groups.
    • The dataset is structured so that users can understand and replicate a complete analysis starting from raw data.
    • It is suitable for learners, researchers, or anyone wanting a practical reference for bioinformatics pipelines.
    • The files can be used for training, project demonstrations, or teaching reproducible data analysis principles.
    • This dataset provides clear examples of how to format results for downstream interpretation and publication use.
    • It showcases how biological insights can be extracted from the GSE57691 dataset using standard bioinformatics tools.
    • All generated outputs represent steps commonly used in modern transcriptomics research and analysis workflows.

  10. Data from: Supporting data for "Building a community-driven bioinformatics...

    • researchportal.scu.edu.au
    txt
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Locedie Mansueto; Tobias Kretzschmar; Ramil, P Mauleon; Graham King (2025). Supporting data for "Building a community-driven bioinformatics platform to facilitate Cannabis sativa multi-omics research" [Dataset]. https://researchportal.scu.edu.au/esploro/outputs/dataset/Supporting-data-for-Building-a-community-driven/991013285435302368
    Explore at:
    txt(23671 bytes)Available download formats
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    GigaDBhttp://gigadb.org/
    Authors
    Locedie Mansueto; Tobias Kretzschmar; Ramil, P Mauleon; Graham King
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2024
    Dataset funded by
    Australian Research Council (Australia, Canberra) - ARC
    Description

    Global changes in Cannabis legislation after decades of stringent regulation, and heightened demand for its industrial and medicinal applications have spurred recent genetic and genomics research. An international research community emerged and identified the need for a web portal to host Cannabis-specific datasets that seamlessly integrates multiple data sources and serves omics-type analyses, fostering information sharing. The Tripal platform was used to host public genome assemblies, gene annotations, QTL and genetic maps, gene and protein expression, metabolic profile and their sample attributes. SNPs were called using public resequencing datasets on three genomes. Additional applications, such as SNP-Seek and MapManJS, were embedded into Tripal. A multi-omics data integration web-service API, developed on top of existing Tripal modules, returns generic tables of sample, property, and values. Use-cases demonstrate the API's utility for various -omics analyses, enabling researchers to perform multi- omics analyses efficiently.

  11. r

    Data from: DNA metabarcoding captures subtle differences in forest beetle...

    • researchdata.edu.au
    Updated 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge (2020). DNA metabarcoding captures subtle differences in forest beetle communities following disturbance [Dataset]. https://researchdata.edu.au/dna-metabarcoding-captures-following-disturbance/1676001
    Explore at:
    Dataset updated
    2020
    Dataset provided by
    University of Tasmania, Australia
    Authors
    Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset includes all raw Miseq high-throughput sequencing data, bioinformatic pipeline and R codes that were used in the publication "Liu M, Baker SC, Burridge CP, Jordan GJ, Clarke LJ (2020) DNA metabarcoding captures subtle differences in forest beetle communities following disturbance. Restoration Ecology. 28:1475-1484. DOI:10.1111/rec.13236."

    Miseq_16S.zip - Miseq sequencing dataset for gene marker 16S, including 48 fastq files for 24 beetle bulk samples; Miseq_CO1.zip -Miseq sequencing dataset for gene marker CO1, including 46 fastq files for 23 beetle bulk samples (one sample failed to be sequenced); nfp4MBC.nf - A nextflow bioinformatic script to process Miseq datasets; nextflow.config - A configuratioin file needed when using nfp4MBC.nf; adapters_16S.zip - Adapters used to tag each of 24 beetle bulk samples for 16S, also used to process 16S Miseq dataset when using nfp4MBC.nf; adapters_CO1.zip - Adapters used to tag each of 24 beetle bulk samples for CO1, also used to process CO1 Miseq dataset when using nfp4MBC.nf; rMBC.Rmd - R markdown codes for community analyses; rMBC.zip - Datasets used in rMBC.Rmd. COI_ZOTUs_176.fasta - DNA sequences of 176 COI ZOTUs. 16S_ZOTUs_156 -DNA sequences of 156 16S ZOTUs.

  12. s

    Test dataset from: GenErode: a bioinformatics pipeline to investigate genome...

    • figshare.scilifelab.se
    • datasetcatalog.nlm.nih.gov
    • +2more
    application/x-gzip
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verena Kutschera; Marcin Kierczak; Tom van der Valk; Johanna von Seth; Nicolas Dussex; Edana Lord; Marianne Dehasque; David W. G. Stanton; Payam Emami Khoonsari; Björn Nystedt; Love Dalén; David Díez del molino (2025). Test dataset from: GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species [Dataset]. http://doi.org/10.17044/scilifelab.19248172.v2
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    National Bioinformatics Infrastructure Sweden (Stockholm University & Science for Life Laboratory)
    Authors
    Verena Kutschera; Marcin Kierczak; Tom van der Valk; Johanna von Seth; Nicolas Dussex; Edana Lord; Marianne Dehasque; David W. G. Stanton; Payam Emami Khoonsari; Björn Nystedt; Love Dalén; David Díez del molino
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    This item contains a test dataset based on Sumatran rhinoceros (Dicerorhinus sumatrensis) whole-genome re-sequencing data that we publish along with the GenErode pipeline (https://github.com/NBISweden/GenErode; Kutschera et al. 2022) and that we reduced in size so that users have the possibility to get familiar with the pipeline before analyzing their own genome-wide datasets. We extracted scaffold ‘Sc9M7eS_2_HRSCAF_41’ of size 40,842,778 bp from the Sumatran rhinoceros genome assembly (Dicerorhinus sumatrensis harrissoni; GenBank accession number GCA_014189135.1) to be used as reference genome in GenErode. Some GenErode steps require the reference genome of a closely related species, so we additionally provide three scaffolds from the White rhinoceros genome assembly (Ceratotherium simum simum; GenBank accession number GCF_000283155.1) with a combined length of 41,195,616 bp that are putatively orthologous to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with gene predictions in GTF format. The repository also contains a Sumatran rhinoceros mitochondrial genome (GenBank accession number NC_012684.1) to be used as reference for the optional mitochondrial mapping step in GenErode. The test dataset contains whole-genome re-sequencing data from three historical and three modern Sumatran rhinoceros samples from the now-extinct Malay Peninsula population from von Seth et al. (2021) that was subsampled to paired-end reads that mapped to Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with a small proportion of randomly selected reads that mapped to the Sumatran rhinoceros mitochondrial genome or elsewhere in the genome. For GERP analyses, scaffolds from the genome assemblies of 30 mammalian outgroup species are provided that had reciprocal blast hits to gene predictions from Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’. Further, a phylogeny of the White rhinoceros and the 30 outgroup species including divergence time estimates (in billions of years) from timetree.org is available. Finally, the item contains configuration and metadata files that were used for three separate runs of GenErode to generate the results presented in Kutschera et al. (2022). Bash scripts and a workflow description for the test dataset generation are available in the GenErode GitHub repository (https://github.com/NBISweden/GenErode/docs/extras/test_dataset_generation).

    References: Kutschera VE, Kierczak M, van der Valk T, von Seth J, Dussex N, Lord E, et al. GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species. BMC Bioinformatics 2022;23:228. https://doi.org/10.1186/s12859-022-04757-0 von Seth J, Dussex N, Díez-Del-Molino D, van der Valk T, Kutschera VE, Kierczak M, et al. Genomic insights into the conservation status of the world’s last remaining Sumatran rhinoceros populations. Nature Communications 2021;12:2393.

  13. Gene Expression Analysis and Disease Relationship

    • kaggle.com
    zip
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asel (2025). Gene Expression Analysis and Disease Relationship [Dataset]. https://www.kaggle.com/datasets/ylmzasel/gene-expression-analysis-and-disease-relationship/code
    Explore at:
    zip(8740 bytes)Available download formats
    Dataset updated
    Aug 4, 2025
    Authors
    asel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises 1000 hypothetical patient or sample entries, each detailing gene expression profiles and relevant clinical characteristics. It includes a mix of both numerical and categorical data types, allowing for the application of diverse machine learning and statistical analysis methods

    Column Descriptions: PatientID (Categorical/Numerical): A unique identification number assigned to each patient. Age (Numerical): The patient's age. Can be used to investigate potential correlations between age and gene expression profiles. Gender (Categorical): The patient's gender (0: Female, 1: Male). Effects of gender on gene expression or disease status can be analyzed. Gene_X_Expression (Numerical): The relative expression level of a specific gene, "Gene X". This represents a hypothetical gene that might play a role in disease progression or treatment response. Gene_Y_Expression (Numerical): The relative expression level of another specific gene, "Gene Y". Can be studied in conjunction with or independently of Gene X. SmokingStatus (Categorical): The patient's smoking status (0: Non-smoker, 1: Ex-smoker, 2: Current smoker). Environmental factors' impact on gene expression and disease can be assessed. DiseaseStatus (Categorical): The patient's status for the target disease (0: Healthy, 1: Disease A, 2: Disease B). This can serve as the primary target variable for your predictive models.

    TreatmentResponse (Categorical/Numerical): The degree of response to applied treatment (0: No Response, 1: Partial Response, 2: Full Response). The role of gene expression profiles in predicting treatment success can be explored. Use Cases and Potential Projects This dataset serves as an excellent starting point for students, researchers, and enthusiasts in bioinformatics, computational biology, data science, and machine learning, enabling various projects such as: Disease Diagnosis/Classification: Building models to predict HastalıkDurumu using gene expression levels and other clinical factors. Treatment Response Prediction: Forecasting how patients with specific gene expression profiles might respond to treatment (TedaviYanıtı). Biomarker Discovery: Identifying gene expression levels (e.g., Gen_X_İfadesi, Gen_Y_İfadesi) that show strong correlations with disease or treatment response. Feature Engineering and Selection: Evaluating the importance of various features in the dataset and creating new ones to enhance model performance. Data Visualization: Generating visualizations to explore relationships between gene expression data and demographic/clinical factors. Regression and Correlation Analyses: Quantitatively examining the effects of factors like age and smoking status on gene expression levels.

    Why Use This Dataset? Privacy Secure: Being entirely synthetic, it carries no privacy or ethical concerns associated with real patient data. Diversity: The mix of both numerical and categorical variables offers a rich ground for experimenting with different analytical techniques. Predictive Potential: Clear target variables like HastalıkDurumu and TedaviYanıtı make it ideal for developing classification and regression models. Educational and Learning: Perfect for applying fundamental data science and machine learning concepts for anyone interested in the bioinformatics domain.

  14. Dataset from:An Evaluation of Large Language Models in Bioinformatics...

    • zenodo.org
    zip
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengchuang Yin; Hengchuang Yin; Lun Hu; Lun Hu (2025). Dataset from:An Evaluation of Large Language Models in Bioinformatics Research [Dataset]. http://doi.org/10.5281/zenodo.16419266
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hengchuang Yin; Hengchuang Yin; Lun Hu; Lun Hu
    Description

    This repository contains the data and code to reproduce the results of our paper: An Evaluation of Large Language Models in Bioinformatics Research

    Authors: Hengchuang Yin; Lun Hu

    Abstract: Large language models, such as the GPT series, have revolutionized natural language processing by demonstrating strong capabilities in text generation and reasoning. However, their potential in the field of bioinformatics, characterized by complex biological data and specialized knowledge, has not been fully evaluated. In this study, we systematically assess the performance of multiple advanced and widely used LLMs on six diverse bioinformatics tasks: drug-drug interaction prediction, antimicrobial and anticancer peptide identification, molecular optimization, gene and protein named entity recognition, single-cell type annotation, and bioinformatics problem solving.
    Our experimental results demonstrate that, with appropriate prompt design and limited task-specific fine-tuning, general-purpose LLMs can achieve competitive or even superior performance compared to traditional models that require extensive computational resources and technical design across various tasks. Our analysis further uncovers the current limitations of LLMs in handling structurally complex and knowledge-intensive bioinformatics problems. Overall, this study demonstrates the broad prospects of LLMs in bioinformatics while emphasizing their limitations, providing valuable insights for future research at the intersection of LLMs and bioinformatics.

    Section_A_ddi:

    ddinter_positive_samples.csv: Positive drug–drug interaction (DDI) pairs curated from the DDInter database.

    ddinter_negative_samples.csv: Negative drug–drug interaction (DDI) pairs (no known interactions), used for supervised classification.

    drug_description_embeddings_all-mpnet-base-v2.npy: Drug description embeddings generated using the all-mpnet-base-v2 model.

    drug_description_embeddings_bge-large-en-v1.5.npy: Drug description embeddings generated using the bge-large-en-v1.5 model.

    drug_description_embeddings_e5-small-v2.npy: Drug description embeddings generated using the e5-small-v2 model.

    drug_description_embeddings_gtr-t5-large.npy: Drug description embeddings generated using the gtr-t5-large model.

    drug_description_embeddings_text_embedding_3_large.npy: Drug description embeddings generated using OpenAI's text-embedding-3-large model.

    drug_description_embeddings_text_embedding_3_small.npy: Drug description embeddings generated using OpenAI's text-embedding-3-small model.

    drug_description_embeddings_text_embedding_ada_002.npy: Drug description embeddings generated using OpenAI's text-embedding-ada-002 model.

    We hereby confirm that the dataset associated with the research described in this work is made available to the public under the Creative Commons Zero (CC0) license.

    Contact

    If you have any questions, please don't hesitate to ask me: yinhengchuang@ms.xjb.ac.cn or hulun@ms.xjb.ac.cn

  15. s

    Data used in exercises in course Introduction to Data Management Practices

    • figshare.scilifelab.se
    • researchdata.se
    zip
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yvonne Kallberg; Elin Kronander; Niclas Jareborg; Markus Englund; Wolmar Nyberg Åkerström (2025). Data used in exercises in course Introduction to Data Management Practices [Dataset]. http://doi.org/10.17044/scilifelab.14301317.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Uppsala University
    Authors
    Yvonne Kallberg; Elin Kronander; Niclas Jareborg; Markus Englund; Wolmar Nyberg Åkerström
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the data files used in exercises in the NBIS course "Introduction to Data Management Practices".

  16. n

    Paired omics Data Platform projects

    • narcis.nl
    • datasetcatalog.nlm.nih.gov
    • +3more
    Updated 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verhoeven, Stefan; Schorn, Michelle; Medema, Marnix H.; Dorrestein, Pieter C.; van der Hooft, Justin (2020). Paired omics Data Platform projects [Dataset]. http://doi.org/10.5281/zenodo.3736430
    Explore at:
    Dataset updated
    2020
    Dataset provided by
    Netherlands eScience Center
    Authors
    Verhoeven, Stefan; Schorn, Michelle; Medema, Marnix H.; Dorrestein, Pieter C.; van der Hooft, Justin
    Description

    The Paired Omics Data Platform is a community-based initiative standardizing links between genomic and metabolomics data in a computer readable format to further the field of natural products discovery. The goals are to link molecules to their producers, find large scale genome-metabolome associations, use genomic data to assist in structural elucidation of molecules, and provide a centralized database for paired datasets. This dataset contains the projects in http://pairedomicsdata.bioinformatics.nl/. The JSON documents adhere to the http://pairedomicsdata.bioinformatics.nl/schema.json JSON schema.

  17. u

    AMtb_1_202402 pod5 data

    • figshare.unimelb.edu.au
    bin
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Hall (2024). AMtb_1_202402 pod5 data [Dataset]. http://doi.org/10.26188/25495045.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    The University of Melbourne
    Authors
    Michael Hall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the pod5 data from an Oxford Nanopore Technologies sequencing run using an R10.4.1 flowcell (FLO-MIN114).There may be pod5 files from multiple runs, but those in this archive are those deemed assigned to this sample from dorado sup basecalling demultiplexing. As such, no demultiplexing is required.Sample information:Sample alias: AMtb_1_202402BioSample: SAMN40453083Species: Mycobacterium tuberculosisFor further information about the study, refer to the linked manuscript.For details of the code and associated data used for the analysis, refer to the GitHub repository.

  18. d

    metabarcoding data for: Benchmark of bioinformatics tools for fast and...

    • search.dataone.org
    • datadryad.org
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laetitia Mathon (2025). metabarcoding data for: Benchmark of bioinformatics tools for fast and accurate species identification from environmental DNA metabarcoding [Dataset]. http://doi.org/10.5061/dryad.15dv41nx6
    Explore at:
    Dataset updated
    May 17, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Laetitia Mathon
    Time period covered
    Jan 1, 2021
    Description

    This dataset contains fish DNA sequences samples, simulated with Grinder, to build a mock community, as well as real fish eDNA metabarcoding data from the Mediterranean sea.

    These data have been used to compare the efficiency of different bioinformatic tools in retrieving the species composition of real and simulated samples.

  19. f

    Dataset: The potential of genome-wide RAD sequences for resolving rapid...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jul 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khan, Gulzar; Zappi, Daniela C.; Franco, Fernando Faria; Ribolla, Paulo Eduardo Martins; Taylor, Nigel; Silva, Gislaine Angélica Rodrigues; Amaral, Danilo Trabuco; Moraes, Evandro Marsola; da Silva Andrade, Sónia Cristina; Eaton, Deren A. R.; Alonso, Diego Peres; Bombonato, Juliana Rodrigues (2020). Dataset: The potential of genome-wide RAD sequences for resolving rapid radiations: a case study in Cactaceae [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000568613
    Explore at:
    Dataset updated
    Jul 28, 2020
    Authors
    Khan, Gulzar; Zappi, Daniela C.; Franco, Fernando Faria; Ribolla, Paulo Eduardo Martins; Taylor, Nigel; Silva, Gislaine Angélica Rodrigues; Amaral, Danilo Trabuco; Moraes, Evandro Marsola; da Silva Andrade, Sónia Cristina; Eaton, Deren A. R.; Alonso, Diego Peres; Bombonato, Juliana Rodrigues
    Description

    The reconstruction of relationships within recently radiated groups is challenging even when massive amounts of sequencing data are available. The use of restriction site-associated DNA sequencing (RAD-Seq) to this end is promising. Here, we assessed the performance of RAD-Seq to infer the species-level phylogeny of the rapidly radiating genus Cereus (Cactaceae). To examine how the amount of genomic data affects resolution in this group, we used distinct datasets and implemented different analyses. We sampled 52 individuals of Cereus, representing 18 of the 25 species currently recognized, plus members of the closely allied genera Cipocereus and Praecereus, and other 11 Cactaceae genera as outgroups. Three scenarios of permissiveness to missing data were carried out in iPyRAD, assembling datasets with 4330% (333 loci), 45% (1440 loci), and 70% (6141 loci) of missing data. For each dataset, Maximum Likelihood (ML) trees were generated using two supermatrices, i.e., only SNPs and SNPs plus invariant sites. Accuracy and resolution were improved when the dataset with the highest number of loci was used (6141 loci), despite the high percentage of missing data included (70%). Coalescent trees estimated using SVDQuartets and ASTRAL are similar to those obtained by the ML reconstructions. Overall, we reconstruct a well-supported phylogeny of Cereus, which is resolved as monophyletic and composed of four main clades with high support in their internal relationships. Our findings also provide insights into the impact of missing data for phylogeny reconstruction using RAD loci. SamplingOur dataset includes 63 samples spanning 52 ingroups of Cereus and 11 outgroups (Table 1). ddRAD library preparation and sequencing 157Genomic DNA was extracted from root tissues using the DNeasy Plant Mini Kit (Qiagen). ddRAD libraries were prepared using high fidelity EcoRI and HPAII restriction enzymes following Campos et al. (2017) and Khan et al. (2019). Details of library preparation and sequencing are shown in Supplementary materialBioinformatics analyses Raw data were trimmed for adapters and quality filtered before SNPs calling. The quality of sequencing data was checked with FastQC 0.11.2 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc), visualized in MultiQC 1.0 (https://github.com/ewels/MultiQC), and filtered with SeqyClean 1.9.12 (Zhbannikov et al., 2017) using the following settings: minimum quality (Phred Score 20), minimum size (>65 bp), and Illumina contaminants (UniVec.fas). We used the iPyRAD pipeline (available at http://github.com/dereneaton/ipyrad) to identify homology among reads, make SNP calls, and format output files. The following parameter settings were implemented: mindepth_majrule = 6 (minimum depth for majority-rule base calling), clust_threshold = 0.85 (clustering threshold for de novo assembly), filter_adapters = 2 (strict filter), max_Hs_consens = 6 (maximum heterozygotes in consensus), min_samples_locus (minimum percentage of samples per locus 184for output). For the latter, values varied in three distinct scenarios concerning the permissiveness to missing data. These scenarios considered that the final set of loci should have at least 39 samples (scenario 1, approximately 30% of missing data), 26 samples (scenario 2, approximately 45% of missing data), or 13 samples (scenario 3, approximately 70% of missing data). After SNP calling, CD-HIT (Li and Godzik, 2006; Fu et al., 2012) was used to identify reverse-complement duplicates in the loci recovered by iPyRAD.

  20. f

    Data_Sheet_1_GUAVA: A Graphical User Interface for the Analysis and...

    • datasetcatalog.nlm.nih.gov
    Updated Jul 17, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheung, Edwin; Divate, Mayur (2018). Data_Sheet_1_GUAVA: A Graphical User Interface for the Analysis and Visualization of ATAC-seq Data.xlsx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000662843
    Explore at:
    Dataset updated
    Jul 17, 2018
    Authors
    Cheung, Edwin; Divate, Mayur
    Description

    Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) is a powerful genomic technology that is used for the global mapping and analysis of open chromatin regions. However, for users to process and analyze such data they either have to use a number of complicated bioinformatic tools or attempt to use the currently available ATAC-seq analysis software, which are not very user friendly and lack visualization of the ATAC-seq results. Because of these issues, biologists with minimal bioinformatics background who wish to process and analyze their own ATAC-seq data by themselves will find these tasks difficult and ultimately will need to seek help from bioinformatics experts. Moreover, none of the available tools provide complete solution for ATAC-seq data analysis. Therefore, to enable non-programming researchers to analyze ATAC-seq data on their own, we developed a tool called Graphical User interface for the Analysis and Visualization of ATAC-seq data (GUAVA). GUAVA is a standalone software that provides users with a seamless solution from beginning to end including adapter trimming, read mapping, the identification and differential analysis of ATAC-seq peaks, functional annotation, and the visualization of ATAC-seq results. We believe GUAVA will be a highly useful and time-saving tool for analyzing ATAC-seq data for biologists with minimal or no bioinformatics background. Since GUAVA can also operate through command-line, it can easily be integrated into existing pipelines, thus providing flexibility to users with computational experience.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu