100+ datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. Dataset from:An Evaluation of Large Language Models in Bioinformatics...

    • zenodo.org
    zip
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengchuang Yin; Hengchuang Yin; Lun Hu; Lun Hu (2025). Dataset from:An Evaluation of Large Language Models in Bioinformatics Research [Dataset]. http://doi.org/10.5281/zenodo.16419266
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hengchuang Yin; Hengchuang Yin; Lun Hu; Lun Hu
    Description

    This repository contains the data and code to reproduce the results of our paper: An Evaluation of Large Language Models in Bioinformatics Research

    Authors: Hengchuang Yin; Lun Hu

    Abstract: Large language models, such as the GPT series, have revolutionized natural language processing by demonstrating strong capabilities in text generation and reasoning. However, their potential in the field of bioinformatics, characterized by complex biological data and specialized knowledge, has not been fully evaluated. In this study, we systematically assess the performance of multiple advanced and widely used LLMs on six diverse bioinformatics tasks: drug-drug interaction prediction, antimicrobial and anticancer peptide identification, molecular optimization, gene and protein named entity recognition, single-cell type annotation, and bioinformatics problem solving.
    Our experimental results demonstrate that, with appropriate prompt design and limited task-specific fine-tuning, general-purpose LLMs can achieve competitive or even superior performance compared to traditional models that require extensive computational resources and technical design across various tasks. Our analysis further uncovers the current limitations of LLMs in handling structurally complex and knowledge-intensive bioinformatics problems. Overall, this study demonstrates the broad prospects of LLMs in bioinformatics while emphasizing their limitations, providing valuable insights for future research at the intersection of LLMs and bioinformatics.

    Section_A_ddi:

    ddinter_positive_samples.csv: Positive drug–drug interaction (DDI) pairs curated from the DDInter database.

    ddinter_negative_samples.csv: Negative drug–drug interaction (DDI) pairs (no known interactions), used for supervised classification.

    drug_description_embeddings_all-mpnet-base-v2.npy: Drug description embeddings generated using the all-mpnet-base-v2 model.

    drug_description_embeddings_bge-large-en-v1.5.npy: Drug description embeddings generated using the bge-large-en-v1.5 model.

    drug_description_embeddings_e5-small-v2.npy: Drug description embeddings generated using the e5-small-v2 model.

    drug_description_embeddings_gtr-t5-large.npy: Drug description embeddings generated using the gtr-t5-large model.

    drug_description_embeddings_text_embedding_3_large.npy: Drug description embeddings generated using OpenAI's text-embedding-3-large model.

    drug_description_embeddings_text_embedding_3_small.npy: Drug description embeddings generated using OpenAI's text-embedding-3-small model.

    drug_description_embeddings_text_embedding_ada_002.npy: Drug description embeddings generated using OpenAI's text-embedding-ada-002 model.

    We hereby confirm that the dataset associated with the research described in this work is made available to the public under the Creative Commons Zero (CC0) license.

    Contact

    If you have any questions, please don't hesitate to ask me: yinhengchuang@ms.xjb.ac.cn or hulun@ms.xjb.ac.cn

  3. d

    Data from: Semi-artificial datasets as a resource for validation of...

    • search.dataone.org
    • zenodo.org
    • +1more
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Sébastien Massart (2025). Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection [Dataset]. http://doi.org/10.5061/dryad.0zpc866z8
    Explore at:
    Dataset updated
    May 21, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Sébastien Massart
    Time period covered
    Jan 1, 2021
    Description

    In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes.

    Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a “real†HTS dataset spiked with artificial viral reads. It will allow researchers to adjust ...

  4. BOriS Training Datasets

    • figshare.com
    txt
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Theodor Sperlea (2023). BOriS Training Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.8108357.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Theodor Sperlea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Positive and negative dataset created for training of machine learning models to classify/identify gammaproteobacterial oriC sequences.

  5. f

    Data from: A large-scale analysis of bioinformatics code on GitHub

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas (2018). A large-scale analysis of bioinformatics code on GitHub [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000639408
    Explore at:
    Dataset updated
    Oct 31, 2018
    Authors
    Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas
    Description

    In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.

  6. Z

    Scorpio Gene-Taxa Benchmark Dataset

    • data.niaid.nih.gov
    • resodate.org
    • +1more
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Refahi, Mohammad Saleh (2025). Scorpio Gene-Taxa Benchmark Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12175912
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset provided by
    Drexel University
    Authors
    Refahi, Mohammad Saleh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.

    To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.

    We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.

  7. Data from: EukProt: a database of genome-scale predicted proteins across the...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas (2023). EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotes [Dataset]. http://doi.org/10.6084/m9.figshare.12417881.v3
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Version 3 (22 November, 2021)

    See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above).

    Scroll to the end of this page for changes since version 2.

    Are we missing anything? Please let us know!

    EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.

    This release contains 5 files:

    EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17).

    EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file.

    EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:

    EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.

    Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files.

    Strain: the strain(s) of the species sequenced.

    Previous_Names: any previous names that this species was known by.

    Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).

    Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691).

    Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).

    Merged_Strains: whether multiple strains of the same species were merged to create the data set.

    Data_Source_URL: the URL(s) from which the data were downloaded.

    Data_Source_Name: the name of the data set (as assigned by the data source).

    Paper_DOI: the DOI(s) of the paper(s) that published the data set.

    Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified.

    Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).

    Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included).

    Columns_Modified_Since_Previous_Version: column(s) in this file modified for the data set since the previous release. Not listed: modifications to the Notes column or to new columns added in this version.

    Alternative_Strain_Names: non-exhaustive list of alternative names for the sequenced strain for this data set.

    18S_Sequence_GenBank_ID: GenBank identifier for the strain sequenced in the data set. When multiple strains were sequenced, identifiers are separated with a comma, in the same order as the Strain column. Ranges of identifiers for the same strain are separated by a hyphen. ‘N/A’ indicates either that there is no GenBank sequence for the strain or that all available sequences are not full-length (< 1,500 bp).

    18S_Sequence: 18S for the strain derived from publicly available sequences associated with the data set, in the case where a GenBank sequence is not available.

    18S_Sequence_Source: the source for the sequence in the 18S_Sequence column, if any.

    18S_Sequence_Other_Strain_GenBank_ID: GenBank identifier for 18S sequence(s) from other strains of the same species as the data set.

    18S_Sequence_Other_Strain_Name: strain name(s) for the sequences in the 18S_Sequence_Other_Strain_GenBank_ID column.

    18S_and_Taxonomy_Notes: additional information on the values in the 18S_Sequence columns.

    Changes since version 2

    There are 324 new data sets included. 57 of these replace data sets from version 2.

    40 newly published data sets were added to the list that are not included in the database (annotated in the Notes column with the reasons they were not included).

    Instead of unannotated genomes (for published genomes lacking protein predictions), we now include predicted proteins and gene annotations (in GFF3 format).

    All sequences within each file are now assigned a standardized, unique identifier based on the data set’s EukProt_ID and on the type of data (protein or transcriptome). Illegal characters are removed from sequences.

    In the UniEuk_Taxonomy field, single quotes are now used instead of double quotes, to be consistent with other UniEuk databases (EukMap, EukRibo).

    Changes to metadata of individual data sets (in the included and not_included tables) with respect to the previous version are now listed in the Columns_Modified_Since_Previous_Version column.

    The Taxogroup_UniEuk column has been split into the Taxogroup1_UniEuk and Taxogroup2_UniEuk columns. This resulted in the Supergroup_UniEuk column changing for Opisthokonta.

    In addition, the following new columns have been added (see our manuscript for details): Alternative_Strain_Names, 18S_Sequence_GenBank_ID, 18S_Sequence, 18S_Sequence_Source, 18S_Sequence_Other_Strain_GenBank_ID, 18S_Sequence_Other_Strain_Name, 18S_and_Taxonomy_Notes.

    EukProt_assembled_transcriptomes.v03.2021_11_22.tgz: assembled transcriptome contigs, for 126 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file.

    Sequence names in the proteins and transcriptomes files have standardized, unique identifiers with the following format:

    [EukProt ID]_[Name_to_Use]_[Type abbreviation][Counter] [Previous header contents]

    Type abbreviations are P (protein) and T (transcriptome).

    All characters not in the following list are removed from nucleic acid sequences: ACGTNUKSYMWRBDHV All characters not in the the following list are removed from protein sequences: ABCDEFGHIKLMNPQRSTUVWYZX*

    Lists of legal characters are from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

  8. Bioinformatics Collection

    • kaggle.com
    zip
    Updated Jul 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Shtrauss (2022). Bioinformatics Collection [Dataset]. https://www.kaggle.com/shtrausslearning/bioinformatics
    Explore at:
    zip(155016771 bytes)Available download formats
    Dataset updated
    Jul 31, 2022
    Authors
    Andrey Shtrauss
    Description

    Dataset

    This dataset was created by Andrey Shtrauss

    Contents

  9. Sample DNA Sequence

    • kaggle.com
    zip
    Updated Jan 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sreshta Putchala (2021). Sample DNA Sequence [Dataset]. https://www.kaggle.com/sreshta140/covid19-genome-sequence
    Explore at:
    zip(69652 bytes)Available download formats
    Dataset updated
    Jan 14, 2021
    Authors
    Sreshta Putchala
    Description

    Dataset

    This dataset was created by Sreshta Putchala

    Contents

  10. m

    Dataset on the transcriptome of the mangrove oyster Crassostrea gasar

    • data.mendeley.com
    Updated Jun 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clarissa Pellegrini Ferreira (2023). Dataset on the transcriptome of the mangrove oyster Crassostrea gasar [Dataset]. http://doi.org/10.17632/krnyfshh82.2
    Explore at:
    Dataset updated
    Jun 26, 2023
    Authors
    Clarissa Pellegrini Ferreira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The growing accessibility to NGS technologies led to the rapid development of innovative bioinformatics tools to analyze sequence data and fill the databases with new finds. However, current bioinformatics workflows focus on emerging technologies data. In contrast, bioinformatics tools dedicated to former sequencing methods (i.e., Roche 454 GS FLX+) have become deprecated. Therefore, we revisited the pyro-sequenced C. gasar raw reads, and we discovered they contain unraveled and valuable information. As such, this data contains information about the comparison of different bioinformatics tools and their capabilities to generate informative transcriptomes from the Roche 454 read data of gills and digestive glands of the oyster C. gasar previously challenged with environmental pollutants and thus retrieve a more comprehensive transcriptome. The Trinotate pipeline (http://trinotate.github.io) was used to annotate protein function and gene ontology, and pathway assignment of identified open reading frames (ORFs). Nucleotide and protein sequences were used to search against the NCBI nr and UniProtKB/Swiss-Prot (uniprot_sprot.trinotate_v2.0.pep.gz) databases using NCBI-BLASTx and BLASTp v2.10.0 (e-value 1e−10 -max_target_seqs 1 -outfmt 6), respectively. In this novel transcriptome, we were able to identify genes related to Zn distribution in cells (Zn transporters - ZIP, ZnT), metallothionein (MTI and MTIV), Ca+ transporter (NCX and ATP2B), and Cu distribution in cells (ATP7, ATOX1, CCS, and laccase-like).

  11. E

    Simulated metagenomic dataset for Smith et al. 2022

    • dtechtive.com
    • find.data.gov.scot
    txt
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh. The Roslin Institute (2022). Simulated metagenomic dataset for Smith et al. 2022 [Dataset]. http://doi.org/10.7488/ds/3444
    Explore at:
    txt(0.0166 MB)Available download formats
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    University of Edinburgh. The Roslin Institute
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    UNITED KINGDOM
    Description

    This dataset is simulated metagenomic data created by Rebecca (Becky) Smith, PhD student at the Roslin Institute in Mick Watson's group. This data is described in detail in Smith et al. 2022, but briefly these reads were simulated using InSilicoSeq (https://doi.org/10.1093/bioinformatics/bty630) with the hiseq exponential model, and 150bp. The genomes used to create this data are from the Hungate Collection (paper at https://www.nature.com/articles/nbt.4110 and sequences at https://genome.jgi.doe.gov/portal/HungateCollection/HungateCollection.info.html ).

  12. s

    References and test datasets for the Cactus pipeline

    • figshare.scilifelab.se
    • researchdata.se
    txt
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerome Salignon; Lluis Milan Arino; Maxime Garcia; Christian Riedel (2025). References and test datasets for the Cactus pipeline [Dataset]. http://doi.org/10.17044/scilifelab.20171347.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Jerome Salignon; Lluis Milan Arino; Maxime Garcia; Christian Riedel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview This item contains references and test datasets for the Cactus pipeline. Cactus (Chromatin ACcessibility and Transcriptomics Unification Software) is an mRNA-Seq and ATAC-Seq analysis pipeline that aims to provide advanced molecular insights on the conditions under study.

    Test datasets The test datasets contain all data needed to run Cactus in each of the 4 supported organisms. This include ATAC-Seq and mRNA-Seq data (.fastq.gz), parameter files (.yml) and design files (*.tsv). They were were created for each species by downloading publicly available datasets with fetchngs (Ewels et al., 2020) and subsampling reads to the minimum required to have enough DAS (Differential Analysis Subsets) for enrichment analysis. Datasets downloaded: - Worm and Humans: GSE98758 - Fly: GSE149339 - Mouse: GSE193393

    References One of the goals of Cactus is to make the analysis as simple and fast as possible for the user while providing detailed insights on molecular mechanisms. This is achieved by parsing all needed references for the 4 ENCODE (Dunham et al., 2012; Stamatoyannopoulos et al., 2012; Luo et al., 2020) and modENCODE (THE MODENCODE CONSORTIUM et al., 2010; Gerstein et al., 2010) organisms (human, M. musculus, D. melanogaster and C. elegans). This parsing step was done with a Nextflow pipeline with most tools encapsulated within containers for improved efficiency and reproducibility and to allow the creation of customized references. Genomic sequences and annotations were downloaded from Ensembl (Cunningham et al., 2022). The ENCODE API (Luo et al., 2020) was used to download the CHIP-Seq profiles of 2,714 Transcription Factors (TFs) (Landt et al., 2012; Boyle et al., 2014) and chromatin states in the form of 899 ChromHMM profiles (Boix et al., 2021; van der Velde et al., 2021) and 6 HiHMM profiles (Ho et al., 2014). Slim annotations (cell, organ, development, and system) were parsed and used to create groups of CHIP-Seq profiles that share the same annotations, allowing users to analyze only CHIP-Seq profiles relevant to their study. 2,779 TF motifs were obtained from the Cis-BP database (Lambert et al., 2019). GO terms and KEGG pathways were obtained via the R packages AnnotationHub (Morgan and Shepherd, 2021) and clusterProfiler (Yu et al., 2012; Wu et al., 2021), respectively.

    Documentation More information on how to use Cactus and how references and test datasets were created is available on the documentation website: https://github.com/jsalignon/cactus.

  13. Surface eDNA samples of eukaryote micro- and mesoplankton along the 110°E...

    • obis.org
    • gbif.org
    • +1more
    zip
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO National Collections and Marine Infrastructure (2023). Surface eDNA samples of eukaryote micro- and mesoplankton along the 110°E meridian (Second International Indian Ocean Expedition (IIOE-2) expedition) RV Investigator voyage IN2019_V03 (2019) [Dataset]. http://doi.org/10.1016/j.dsr2.2022.105178
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    CSIROhttps://www.csiro.au/
    Authors
    CSIRO National Collections and Marine Infrastructure
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2019
    Area covered
    Indian Ocean
    Description

    Sea surface planktonic assemblages were sampled using environmental DNA at 1.5° latitudinal increments (from 39.5 to 11.5°S) following the 110°E meridian in the Eastern Indian Ocean to reveal factors structuring eukaryotic diversity. Bioinformatics and analyses were conducted in R version 4.1.1 (R Core Team 2021) and all bioinformatic scripts and matrices/databases are available in the GitHub (https://github.com/wwfoodw/East_Indian_eDNA_Transect). Note that no clustering is used, but Dada2, which runs a model of PCR/sequencing error and then generates lineage specific sequences (called ASVs or zero-radius OTUs). Raw FASTQ sequence data are publicly available in the Short Read Archive under BioProject PRJNA657626. This collection is published as Darwin Core Occurrence, so the event level measurements need to be replicated for every occurrence. Instead of data replication, the event level eMoF data are made available separately at https://www.marine.csiro.au/data/services/obisau/emof_export.cfm?ipt_resource=in2019_v03_edna Voyage details (metadata, projects, other datasets either online or as downloads, publications and reports, events, maps etc) can be accessed at https://www.marine.csiro.au/data/trawler/survey_details.cfm?survey=IN2019_V03 If this data has been used in any products, please acknowledge with the following: We acknowledge the use of the CSIRO Marine National Facility (https://ror.org/01mae9353) in undertaking this research.

  14. n

    Cereal Small RNA Database

    • neuinfo.org
    • dknet.org
    • +1more
    Updated Jan 5, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). Cereal Small RNA Database [Dataset]. http://identifiers.org/RRID:SCR_007589
    Explore at:
    Dataset updated
    Jan 5, 2007
    Description

    CSRDB is a bioinformatics resource for cereal crops consisting of large-scale datasets of maize and rice and small RNA sequences. The sequences were generated by 454 Life Science sequencing. The small RNA sequences have been mapped to the rice genome and available maize genome sequence and are presented in two genome browser datasets using the Generic Genome Browser. Potential target sequences representing mature mRNA sequences have been predicted using the FASTH software from the Zuker lab. and access to the resulting small RNA target pair (SRTP) dataset has been made available through a mysql based relational database. Within the genome browser the small RNAs have links to the SRTP database that will return a list of potential targets. The SRTP database may also be searched independently using both small RNA and target transcript queries. Data linking and integration is the main focus of this interface and to this aim links are present in the SRTP results pages back to the browser and the SRTP database as well as external sites.

  15. DataSheet_1_AppleMDO: A Multi-Dimensional Omics Database for Apple...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lingling Da; Yue Liu; Jiaotong Yang; Tian Tian; Jiajie She; Xuelian Ma; Wenying Xu; Zhen Su (2023). DataSheet_1_AppleMDO: A Multi-Dimensional Omics Database for Apple Co-Expression Networks and Chromatin States.pdf [Dataset]. http://doi.org/10.3389/fpls.2019.01333.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Lingling Da; Yue Liu; Jiaotong Yang; Tian Tian; Jiajie She; Xuelian Ma; Wenying Xu; Zhen Su
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As an economically important crop, apple is one of the most cultivated fruit trees in temperate regions worldwide. Recently, a large number of high-quality transcriptomic and epigenomic datasets for apple were made available to the public, which could be helpful in inferring gene regulatory relationships and thus predicting gene function at the genome level. Through integration of the available apple genomic, transcriptomic, and epigenomic datasets, we constructed co-expression networks, identified functional modules, and predicted chromatin states. A total of 112 RNA-seq datasets were integrated to construct a global network and a conditional network (tissue-preferential network). Furthermore, a total of 1,076 functional modules with closely related gene sets were identified to assess the modularity of biological networks and further subjected to functional enrichment analysis. The results showed that the function of many modules was related to development, secondary metabolism, hormone response, and transcriptional regulation. Transcriptional regulation is closely related to epigenetic marks on chromatin. A total of 20 epigenomic datasets, which included ChIP-seq, DNase-seq, and DNA methylation analysis datasets, were integrated and used to classify chromatin states. Based on the ChromHMM algorithm, the genome was divided into 620,122 fragments, which were classified into 24 states according to the combination of epigenetic marks and enriched-feature regions. Finally, through the collaborative analysis of different omics datasets, the online database AppleMDO (http://bioinformatics.cau.edu.cn/AppleMDO/) was established for cross-referencing and the exploration of possible novel functions of apple genes. In addition, gene annotation information and functional support toolkits were also provided. Our database might be convenient for researchers to develop insights into the function of genes related to important agronomic traits and might serve as a reference for other fruit trees.

  16. Bioinformatics-UAS Kelompok 4

    • kaggle.com
    zip
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony TIF 2022 (2025). Bioinformatics-UAS Kelompok 4 [Dataset]. https://www.kaggle.com/datasets/anthonytif2022/bioinformatics
    Explore at:
    zip(2964027 bytes)Available download formats
    Dataset updated
    Nov 19, 2025
    Authors
    Anthony TIF 2022
    Description

    Dataset

    This dataset was created by Anthony TIF 2022

    Contents

  17. f

    Datasheet2_Integrative analysis of bioinformatics and machine learning to...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Mar 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luan, Yanmin; Zuo, Xiaoli; Ma, Chaoqun; Tu, Dingyuan; Xu, Qiang; Sun, Jie (2024). Datasheet2_Integrative analysis of bioinformatics and machine learning to identify cuprotosis-related biomarkers and immunological characteristics in heart failure.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001355558
    Explore at:
    Dataset updated
    Mar 18, 2024
    Authors
    Luan, Yanmin; Zuo, Xiaoli; Ma, Chaoqun; Tu, Dingyuan; Xu, Qiang; Sun, Jie
    Description

    BackgroundsCuprotosis is a newly discovered programmed cell death by modulating tricarboxylic acid cycle. Emerging evidence showed that cuprotosis-related genes (CRGs) are implicated in the occurrence and progression of multiple diseases. However, the mechanism of cuprotosis in heart failure (HF) has not been investigated yet.MethodsThe HF microarray datasets GSE16499, GSE26887, GSE42955, GSE57338, GSE76701, and GSE79962 were downloaded from the Gene Expression Omnibus (GEO) database to identify differentially expressed CRGs between HF patients and nonfailing donors (NFDs). Four machine learning models were used to identify key CRGs features for HF diagnosis. The expression profiles of key CRGs were further validated in a merged GEO external validation dataset and human samples through quantitative reverse-transcription polymerase chain reaction (qRT-PCR). In addition, Gene Ontology (GO) function enrichment, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment, and immune infiltration analysis were used to investigate potential biological functions of key CRGs.ResultsWe discovered nine differentially expressed CRGs in heart tissues from HF patients and NFDs. With the aid of four machine learning algorithms, we identified three indicators of cuprotosis (DLAT, SLC31A1, and DLST) in HF, which showed good diagnostic properties. In addition, their differential expression between HF patients and NFDs was confirmed through qRT-PCR. Moreover, the results of enrichment analyses and immune infiltration exhibited that these diagnostic markers of CRGs were strongly correlated to energy metabolism and immune activity.ConclusionsOur study discovered that cuprotosis was strongly related to the pathogenesis of HF, probably by regulating energy metabolism-associated and immune-associated signaling pathways.

  18. f

    Data_Sheet_1_Validation of a Bioinformatics Workflow for Routine Analysis of...

    • datasetcatalog.nlm.nih.gov
    Updated Mar 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roosens, Nancy H. C.; Mattheus, Wesley; Fu, Qiang; Ceyssens, Pieter-Jan; Vanneste, Kevin; De Keersmaecker, Sigrid C. J.; Van Braekel, Julien; Bertrand, Sophie; Bogaerts, Bert; Winand, Raf (2019). Data_Sheet_1_Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000172206
    Explore at:
    Dataset updated
    Mar 6, 2019
    Authors
    Roosens, Nancy H. C.; Mattheus, Wesley; Fu, Qiang; Ceyssens, Pieter-Jan; Vanneste, Kevin; De Keersmaecker, Sigrid C. J.; Van Braekel, Julien; Bertrand, Sophie; Bogaerts, Bert; Winand, Raf
    Description

    Despite being a well-established research method, the use of whole-genome sequencing (WGS) for routine molecular typing and pathogen characterization remains a substantial challenge due to the required bioinformatics resources and/or expertise. Moreover, many national reference laboratories and centers, as well as other laboratories working under a quality system, require extensive validation to demonstrate that employed methods are “fit-for-purpose” and provide high-quality results. A harmonized framework with guidelines for the validation of WGS workflows does currently, however, not exist yet, despite several recent case studies highlighting the urgent need thereof. We present a validation strategy focusing specifically on the exhaustive characterization of the bioinformatics analysis of a WGS workflow designed to replace conventionally employed molecular typing methods for microbial isolates in a representative small-scale laboratory, using the pathogen Neisseria meningitidis as a proof-of-concept. We adapted several classically employed performance metrics specifically toward three different bioinformatics assays: resistance gene characterization (based on the ARG-ANNOT, ResFinder, CARD, and NDARO databases), several commonly employed typing schemas (including, among others, core genome multilocus sequence typing), and serogroup determination. We analyzed a core validation dataset of 67 well-characterized samples typed by means of classical genotypic and/or phenotypic methods that were sequenced in-house, allowing to evaluate repeatability, reproducibility, accuracy, precision, sensitivity, and specificity of the different bioinformatics assays. We also analyzed an extended validation dataset composed of publicly available WGS data for 64 samples by comparing results of the different bioinformatics assays against results obtained from commonly used bioinformatics tools. We demonstrate high performance, with values for all performance metrics >87%, >97%, and >90% for the resistance gene characterization, sequence typing, and serogroup determination assays, respectively, for both validation datasets. Our WGS workflow has been made publicly available as a “push-button” pipeline for Illumina data at https://galaxy.sciensano.be to showcase its implementation for non-profit and/or academic usage. Our validation strategy can be adapted to other WGS workflows for other pathogens of interest and demonstrates the added value and feasibility of employing WGS with the aim of being integrated into routine use in an applied public health setting.

  19. R

    Replication data for: "Gene Regulatory Network Inference Methodology for...

    • entrepot.recherche.data.gouv.fr
    bin, csv, text/tsv +2
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry; Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry (2024). Replication data for: "Gene Regulatory Network Inference Methodology for Genomic and Transcriptomic Data Acquired in Genetically Related Heterozygote Individuals", 100 simulated datasets of RNA gene expressions of sunflower hybrids [Dataset]. http://doi.org/10.15454/VRGWZ2
    Explore at:
    text/tsv(659833), tsv(1833), tsv(1800), tsv(1804), tsv(1789), text/tsv(654928), tsv(1872), text/tsv(662958), tsv(1802), text/tsv(661869), tsv(1881), tsv(4586), text/tsv(661641), text/tsv(661988), tsv(1828), tsv(1821), text/tsv(657995), tsv(1819), text/tsv(655256), text/tsv(661909), tsv(1832), text/tsv(661021), text/tsv(662302), tsv(1846), text/tsv(661873), text/tsv(662024), text/tsv(661937), tsv(1830), text/tsv(661653), text/tsv(660064), tsv(1798), tsv(1950), tsv(1862), text/tsv(662551), tsv(1818), tsv(1838), text/tsv(657811), tsv(1773), tsv(1811), tsv(4591), tsv(1770), tsv(1775), text/tsv(655689), text/tsv(661488), tsv(1822), text/tsv(661734), text/tsv(658440), text/tsv(662487), text/tsv(659689), tsv(1827), text/tsv(660842), tsv(1808), csv(259993), text/tsv(139701), tsv(1861), tsv(1859), text/tsv(662756), bin(36), text/tsv(662239), text/tsv(661996), tsv(1851), tsv(17656), text/tsv(661305), text/tsv(660526), text/tsv(662081), tsv(1873), text/tsv(662441), tsv(1743), text/tsv(662142), tsv(1857), text/tsv(662323), tsv(1845), tsv(1787), tsv(1841), tsv(1831), text/tsv(661972), text/tsv(661591), text/tsv(660460), text/tsv(663495), text/tsv(661958), tsv(1858), text/tsv(660991), text/tsv(662072), text/tsv(661964), text/tsv(661906), tsv(1844), csv(265132), tsv(4682602), text/tsv(661830), text/tsv(662327), tsv(4599), tsv(1820), text/tsv(662629), text/tsv(662583), txt(2411), text/tsv(662188), tsv(4587), tsv(1809), csv(264784), tsv(4607), tsv(1840), text/tsv(662244), tsv(1944), tsv(1794), text/tsv(661594), tsv(1777), tsv(1740), text/tsv(661233), text/tsv(661868), tsv(1823), text/tsv(657946), text/tsv(657579), tsv(1877), tsv(1834), csv(258559), tsv(1879), text/tsv(660968), text/tsv(657331), tsv(1801), text/tsv(661994), tsv(4592), tsv(1848), text/tsv(656055), tsv(1860), text/tsv(662154), text/tsv(662133), csv(258052), tsv(1785), text/tsv(662211), text/tsv(662109), tsv(1865), text/tsv(661947), text/tsv(661805), tsv(1825), text/tsv(662460), text/tsv(657571), text/tsv(662397), text/tsv(662023), tsv(1816), text/tsv(661823), text/tsv(659349), text/tsv(661912), text/tsv(660540), tsv(1836), tsv(1842), text/tsv(662615), tsv(1807), tsv(4605), tsv(1835), tsv(1748), text/tsv(661905), tsv(1871), text/tsv(662207), text/tsv(660580), tsv(4603), text/tsv(659207), text/tsv(659100), text/tsv(661987), text/tsv(662427), text/tsv(661978), text/tsv(661593), tsv(1781), text/tsv(657397), tsv(1812), tsv(1799), tsv(1817), tsv(3552), csv(255601), csv(248611), text/tsv(662185), text/tsv(662309), text/tsv(661735), text/tsv(662696), text/tsv(662216), text/tsv(661775), tsv(1813), text/tsv(662176), text/tsv(659784), text/tsv(661273), tsv(1855), text/tsv(659982), text/tsv(662648), tsv(1796), tsv(4590), csv(251719), text/tsv(662132), csv(255565), tsv(1878), text/tsv(662231), text/tsv(658367), tsv(1736), text/tsv(661931), text/tsv(660954), csv(251050)Available download formats
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry; Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Description

    Replication data for: "Gene regulatory network inference methodology for genomic and transcriptomic data acquired in genetically related heterozygote individuals": 100 simulated datasets of RNA gene expressions of sunflower hybrids. This data set includes the 100 simulated datasets that have been used in the paper "Gene regulatory network inference methodology for genomic and transcriptomic data acquired in genetically related heterozygote individuals", Bioinformatics, 2022, https://doi.org/10.1093/bioinformatics/btac445 They are artificial expression datasets created with the data simulator SysGenSIM (modified) from the same gene regulatory network: artificialDataSet_network.csv. The files that have been used to generate the 100 expression datasets are also included (activation/repression sign networkSign and heterosis effect zMatrix directories). The "networks" directory contains the learned networks. The network inference method can be found here. For the description of the files, and the dimensions, see README.txt.

  20. m

    Data from: PeTMbase: A database of plant endogenous target mimics (eTMs)

    • data.mendeley.com
    Updated Nov 23, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gökhan Karakülah (2016). PeTMbase: A database of plant endogenous target mimics (eTMs) [Dataset]. http://doi.org/10.17632/htgxryrcv2.1
    Explore at:
    Dataset updated
    Nov 23, 2016
    Authors
    Gökhan Karakülah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MicroRNAs (miRNA) are small endogenous RNA molecules, which regulate target gene expression at post-transcriptional level. Besides, miRNA activity can be controlled by a newly discovered regulatory mechanism called endogenous target mimicry (eTM). In target mimicry, eTMs bind to the corresponding miRNAs to block the binding of specific transcript leading to increase mRNA expression. Thus, miRNA-eTM-target-mRNA regulation modules involving a wide range of biological processes; an increasing need for a comprehensive eTM database arose. Except miRSponge with limited number of Arabidopsis eTM data no available database and/or repository was developed and released for plant eTMs yet. Here, we present an online plant eTM database, called PeTMbase (http://petmbase.org), with a highly efficient search tool. To establish the repository a number of identified eTMs was obtained utilizing from high-throughput RNA-sequencing data of 11 plant species. Each transcriptome libraries is first mapped to corresponding plant genome, then long non-coding RNA (lncRNA) transcripts are characterized. Furthermore, additional lncRNAs retrieved from GREENC and PNRD were incorporated into the lncRNA catalog. Then, utilizing the lncRNA and miRNA sources a total of 2,728 eTMs were successfully predicted. Our regularly updated database, PeTMbase, provides high quality information regarding miRNA:eTM modules and will aid functional genomics studies particularly, on miRNA regulatory networks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu