100+ datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. Data from: Semi-artificial datasets as a resource for validation of...

    • zenodo.org
    • search.dataone.org
    • +1more
    zip
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Tamisier; Lucie Tamisier; Annelies Haegeman; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Denis Kutnjak; Sébastien Massart; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Sébastien Massart (2022). Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection [Dataset]. http://doi.org/10.5061/dryad.0zpc866z8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucie Tamisier; Lucie Tamisier; Annelies Haegeman; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Denis Kutnjak; Denis Kutnjak; Sébastien Massart; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean Sébastien Reynard; Kristian Stevens; Sébastien Massart
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes.

    Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a "real" HTS dataset spiked with artificial viral reads. It will allow researchers to adjust their pipeline/parameters as good as possible to approximate the actual viral composition of the semi-artificial datasets. Each semi-artificial dataset allows to test one or several limitations that could prevent virus detection or a correct virus identification from HTS data (i.e. low viral concentration, new viral species, non-complete genome).

    Eight artificial datasets only composed of viral reads (no background data) have also been created (Datasets 11 to 18). Each dataset consists of a mix of several isolates from the same viral species showing different frequencies. The viral species were selected to be as divergent as possible. These datasets can be used to test haplotype reconstruction software, the goal being to reconstruct all the isolates present in a dataset.

    A GitLab repository (https://gitlab.com/ilvo/VIROMOCKchallenge) is available and provides a complete description of the composition of each dataset, the methods used to create them and their goals.

  3. Bioinformatic databases survey

    • zenodo.org
    csv
    Updated Aug 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alise Ponsero; Alise Ponsero; Bonnie Hurwitz; Bonnie Hurwitz; Kiran Smelser; Kiran Smelser; Karen Valencia; Lucas Jimenez Miranda; Lucas Jimenez Miranda; Abby McDermott; Karen Valencia; Abby McDermott (2024). Bioinformatic databases survey [Dataset]. http://doi.org/10.5281/zenodo.12790448
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alise Ponsero; Alise Ponsero; Bonnie Hurwitz; Bonnie Hurwitz; Kiran Smelser; Kiran Smelser; Karen Valencia; Lucas Jimenez Miranda; Lucas Jimenez Miranda; Abby McDermott; Karen Valencia; Abby McDermott
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bioinformatic databases survey

    The dataset surveys bioinformatic databases published in the NAR database issue from 1995 to 2022. It evaluates the current number of citations and availability of each ressources.

    Data content

    The dataset is composed of two tables :

    A. Databases table : Contains the information of each database published in the NAR database issue.

    • db_id : Database ID in the dataset
    • resource_name : Name(s) of the database
    • current_access : Latest known web address of the database
    • is_a_pun : The database name is a play on word
    • available_2022 : The database was accessible online during the 2022 survey
    • last_accessible_year : If not accessible, latest point in time where the database was found online (using the Internet web archive snapshots)
    • unavailable_message : If not accessible, the message/error when trying to access the ressource
    • year_first_publication : Year of first publication of the database
    • year_last_publication : Year of latest publication of the database (including database update publications)
    • total_citations_2022 : Cumulative number of citation for all articles of the database
    • nb_authors_max : Maximum number of authors associated to any articles published for that database
    • nb_articles_2022 : Number of articles published for that database in 2022

    B. Articles table : Contains the information collected for the NAR articles

    • collector : Person who contributed to add this database in the dataset
    • article_global_id : DOI of the article surveyed
    • db_id : Database ID of the ressource described in the article
    • article_id : Article unique ID
    • article_year : Article publication year
    • Authors : list of authors of the article. Separated by ";"
    • Author.ID : list of ORCID of the authors of the article. Separated by ";"
    • Title : Title of the atricle
    • Source.title : Journal name
    • Volume : Volume number
    • Issue : Issue number
    • Funding.Details : Funding information of the article
    • Funding.Text : Funding text provided by the authors
    • PubMed.ID : Pubmed ID of the article
    • citations_2016 : Number of citations of the article in 2016 (if published)
    • citations_2022 : Number of citations of the article in 2022
    • nb_authors : Number of authors in the article
    • Index.Keywords : Keywords associated to the publication

    Data sources

    Note that the presented dataset leverage and expand on the dataset gathered and published in Imker, H.J., 2020. Who Bears the Burden of Long-Lived Molecular Biology Databases?. Data Science Journal, 19(1), p.8. The original dataset collected by Dr. Imker is available at : https://doi.org/10.13012/B2IDB-4311325_V1

    The dataset was collected and is maintained by undergraduate students of a CURE class (Course-based Undergraduate Research Experience) held at the University of Arizona. All students of the class have participated to the collection, update and curation the dataset that is available as a database and a web-portal at https://hurwitzlab.shinyapps.io/DS_Heroes/. Students could elect to be added or not as author to this Zenodo repository.

    The CURE class BAT102 "Data Science Heroes: An undergraduate research experience in Open Data Science Practices" gives the students an opportunity to learn about open science and investigate open data practices in bioinformatics through a survey of the databases published in the NAR database issue.

  4. Dataset from:An Evaluation of Large Language Models in Bioinformatics...

    • zenodo.org
    zip
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengchuang Yin; Hengchuang Yin; Lun Hu; Lun Hu (2025). Dataset from:An Evaluation of Large Language Models in Bioinformatics Research [Dataset]. http://doi.org/10.5281/zenodo.16419266
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hengchuang Yin; Hengchuang Yin; Lun Hu; Lun Hu
    Description

    This repository contains the data and code to reproduce the results of our paper: An Evaluation of Large Language Models in Bioinformatics Research

    Authors: Hengchuang Yin; Lun Hu

    Abstract: Large language models, such as the GPT series, have revolutionized natural language processing by demonstrating strong capabilities in text generation and reasoning. However, their potential in the field of bioinformatics, characterized by complex biological data and specialized knowledge, has not been fully evaluated. In this study, we systematically assess the performance of multiple advanced and widely used LLMs on six diverse bioinformatics tasks: drug-drug interaction prediction, antimicrobial and anticancer peptide identification, molecular optimization, gene and protein named entity recognition, single-cell type annotation, and bioinformatics problem solving.
    Our experimental results demonstrate that, with appropriate prompt design and limited task-specific fine-tuning, general-purpose LLMs can achieve competitive or even superior performance compared to traditional models that require extensive computational resources and technical design across various tasks. Our analysis further uncovers the current limitations of LLMs in handling structurally complex and knowledge-intensive bioinformatics problems. Overall, this study demonstrates the broad prospects of LLMs in bioinformatics while emphasizing their limitations, providing valuable insights for future research at the intersection of LLMs and bioinformatics.

    Section_A_ddi:

    ddinter_positive_samples.csv: Positive drug–drug interaction (DDI) pairs curated from the DDInter database.

    ddinter_negative_samples.csv: Negative drug–drug interaction (DDI) pairs (no known interactions), used for supervised classification.

    drug_description_embeddings_all-mpnet-base-v2.npy: Drug description embeddings generated using the all-mpnet-base-v2 model.

    drug_description_embeddings_bge-large-en-v1.5.npy: Drug description embeddings generated using the bge-large-en-v1.5 model.

    drug_description_embeddings_e5-small-v2.npy: Drug description embeddings generated using the e5-small-v2 model.

    drug_description_embeddings_gtr-t5-large.npy: Drug description embeddings generated using the gtr-t5-large model.

    drug_description_embeddings_text_embedding_3_large.npy: Drug description embeddings generated using OpenAI's text-embedding-3-large model.

    drug_description_embeddings_text_embedding_3_small.npy: Drug description embeddings generated using OpenAI's text-embedding-3-small model.

    drug_description_embeddings_text_embedding_ada_002.npy: Drug description embeddings generated using OpenAI's text-embedding-ada-002 model.

    We hereby confirm that the dataset associated with the research described in this work is made available to the public under the Creative Commons Zero (CC0) license.

    Contact

    If you have any questions, please don't hesitate to ask me: yinhengchuang@ms.xjb.ac.cn or hulun@ms.xjb.ac.cn

  5. f

    Data from: A large-scale analysis of bioinformatics code on GitHub

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas (2018). A large-scale analysis of bioinformatics code on GitHub [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000639408
    Explore at:
    Dataset updated
    Oct 31, 2018
    Authors
    Carlson, Nichole E.; Harnke, Benjamin; Russell, Pamela H.; Johnson, Rachel L.; Ananthan, Shreyas
    Description

    In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.

  6. Extracted Schemas from the Life Sciences Linked Open Data Cloud

    • figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maulik Kamdar (2023). Extracted Schemas from the Life Sciences Linked Open Data Cloud [Dataset]. http://doi.org/10.6084/m9.figshare.12402425.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Maulik Kamdar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is related to the manuscript "An empirical meta-analysis of the life sciences linked open data on the web" published at Nature Scientific Data. If you use the dataset, please cite the manuscript as follows:Kamdar, M.R., Musen, M.A. An empirical meta-analysis of the life sciences linked open data on the web. Sci Data 8, 24 (2021). https://doi.org/10.1038/s41597-021-00797-yWe have extracted schemas from more than 80 publicly available biomedical linked data graphs in the Life Sciences Linked Open Data (LSLOD) cloud into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. The dataset published here contains the following files:- The set of Linked Data Graphs from the LSLOD cloud from which schemas are extracted.- Refined Sets of extracted classes, object properties, data properties, and datatypes, shared across the Linked Data Graphs on LSLOD cloud. Where the schema element is reused from a Linked Open Vocabulary or an ontology, it is explicitly indicated.- The LSLOD Schema Graph, which contains all the above extracted schema elements interlinked with each other based on the underlying content. Sample instances and sample assertions are also provided along with broad level characteristics of the modeled content. The LSLOD Schema Graph is saved as a JSON Pickle File. To read the JSON object in this Pickle file use the Python command as follows:with open('LSLOD-Schema-Graph.json.pickle' , 'rb') as infile: x = pickle.load(infile, encoding='iso-8859-1')Check the Referenced Link for more details on this research, raw data files, and code references.

  7. Transcriptomics in yeast

    • kaggle.com
    zip
    Updated Jan 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CostalAether (2017). Transcriptomics in yeast [Dataset]. https://www.kaggle.com/costalaether/yeast-transcriptomics
    Explore at:
    zip(4901525 bytes)Available download formats
    Dataset updated
    Jan 24, 2017
    Authors
    CostalAether
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Disclaimer

    This is a data set of mine that I though might be enjoyable to the community. It's concerning Next generation sequencing and Transcriptomics. I used several raw datasets, that are public, but the processing to get to this dataset is extensive. This is my first contribution to kaggle, so be nice, and let me know how I can improve the experience. NGS machines are combined the biggest data producer worldwide. So why not add some (more? ) to kaggle.

    A look into Yeast transcriptomics

    Background

    Yeasts ( in this case saccharomyces cerevisiae) are used in the production of beer, wine, bread and a whole lot of Biotech applications such as creating complex pharmaceuticals. They are living eukaryotic organisms (meaning quite complex). All living organisms store information in their DNA, but action within a cell is carried out by specific Proteins. The path from DNA to Protein (from data to action) is simple. a specific region on the DNA gets transcribed to mRNA, that gets translated to proteins. Common assumption says that the translation step is linear, more mRNA means more protein. Cells actively regulate the amount of protein by the amount of mRNA it creates. The expression of each gene depends on the condition the cell is in (starving, stressed etc..) Modern methods in Biology show us all mRNA that is currently inside a cell. Assuming the linearity of the process, we can get more protein the more specific mRNA is available to a cell. Making mRNA an excellent marker for what is actually happening inside a cell. It is important to consider that mRNA is fragile. It is actively replenished only when it is needed. Both mRNA and proteins are expensive for a cell to produce .

    Yeasts are good model organisms for this, since they only have about 6000 genes. They are also single cells which is more homogeneous, and contain few advanced features (splice junctions etc.)

    ( all of this is heavily simplified, let me know if I should go into more details )

    The data

    files

    The following files are provided **SC_expression.csv** expression values for each gene over the available conditions **labels_CC.csv ** labels for the individual genes , their status and where known intracellular localization ( see below) Maybe this would be nice as a little competition, I'll see how this one is going before I'll upload the other label files. Please provide some feedback on the presentation, and whatever else you would want me to share.

    background

    I used 92 samples from various openly available raw datasets, and ran them through a modern RNAseq pipeline. Spanning a range of different conditions (I hid the raw names). The conditions covered stress conditions, temperature and heavy metals, as well as growth media changes and the deletion of specific genes. Originally I had 150 sets, 92 are of good enough quality. Evaluation was done on gene level. Each gene got it's own row, Samples are columns (some are in replicates over several columns) . Expression levels were normalized with by TPM (transcripts per million), a default normalization procedure. Raw counts would have been integers, normalized they are floats.

    Analysis and labels

    Genes

    The function of individual genes is a matter of dispute. Clearly living cells are complex. The inner machinations of cells are not visible. Gene functionality is commonly inferred indirectly by removing a gene, and test the cells behavior. This is time consuming and not very precise. As you can see in the dataset, there is still much to be done to fully understand even single cell yeasts.

    The provided dataset is allows for a different approach to functional classification of genes. The label files contained in the set correspond a gene to a specific label. The classification is based on the official Gene Onthology associations classification. I simplified the nomenclature. Gene functionality is usually given in a hierarchical structure. [inside cell --> cytoplasma --> associated to complex A ... ] I'm only keeping high level associations, and using readable terms instead of GO terms. I'll extend if people are interested.

    Labels

    CC labels concern Cellular Component.
    Where the gene is within a cell. goes into details of found associations. the term 'cellular_component' should be seen as E.g the label 'cellular_component' is synonymous with 'unknown location' . CC is the easiest label to attach to a gene. It is the one that can be studied the easiest. Still there are many genes missing.

    MF labels concern Molecular Function. What is the gene doing. [upcoming] BP labels concern Biological Processes. What is the genes involvement. [upcoming]

    The core interest here is whether it is possible to improve the genes classification by modeling the data. A common assu...

  8. cell differentiation tree(dataset)

    • search.datacite.org
    • figshare.com
    Updated Aug 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nazifa Ahmed Moumi (2019). cell differentiation tree(dataset) [Dataset]. http://doi.org/10.6084/m9.figshare.9337469.v3
    Explore at:
    Dataset updated
    Aug 7, 2019
    Dataset provided by
    DataCite
    Figsharehttp://figshare.com/
    Authors
    Nazifa Ahmed Moumi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Every replicates(individual or together) in the 3 datasets H3K4me3, H3K27me3 and H3K36me3 and dataset H3K27ac have a fixed number of cell-types in it. H3K4me3 has two replicates: 1 and 2 . H3K27 holds replicate 1 and replicates 1 and 2 together. Dataset H3K36 has only replicate 1. For combined analysis of H3K4me3 and H3K27me3 we have a folder named H3K4me3-H3K27me3(combined). In each dataset folder there are 4 subfolders named IQA, MLQA ,ML and Overlap representation. Here in the three folders named ML, MLQA and IQA we have included the results from these three cell-type tree generation methods. All the three folders contain cell-type tree in newick format. Estimated quartet files which we generated for MLQA and IQA methods, are given in both MLQA and IQA folders. Finally the overlap representation data for the cell-types are in the folder Overlap representation. In this folder we have a text file named Overlap_datarepresentaion in which the the two numbers in the first row contains the number of cell-types and data length. After that each row identified by t1,t2 etc carries the overlap data. The mapping from t1, t2 etc to original cell-types are provided in file_sequence text file.

  9. I

    Molecular Biology Databases Published in Nucleic Acids Research between...

    • databank.illinois.edu
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heidi Imker (2024). Molecular Biology Databases Published in Nucleic Acids Research between 1991-2016 [Dataset]. http://doi.org/10.13012/B2IDB-4311325_V1
    Explore at:
    Dataset updated
    Feb 1, 2024
    Authors
    Heidi Imker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.

  10. Cell_Gene_Expression_Metadata

    • kaggle.com
    zip
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kazi Aishikuzzaman (2025). Cell_Gene_Expression_Metadata [Dataset]. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata
    Explore at:
    zip(845887409 bytes)Available download formats
    Dataset updated
    Sep 24, 2025
    Authors
    Kazi Aishikuzzaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview This dataset contains comprehensive metadata from single-cell gene expression studies, providing researchers with structured information about cellular phenotypes, experimental conditions, and sample characteristics. The data is particularly valuable for bioinformatics research, machine learning applications in genomics, and comparative studies across different cell types and conditions.

    Dataset Description: The dataset comprises metadata associated with single-cell RNA sequencing (scRNA-seq) experiments, including: Cell Type Information: Classification of different cell types and subtypes Experimental Metadata: Details about experimental conditions, protocols, and methodologies Sample Characteristics: Information about biological samples, including tissue origin, developmental stages, and treatment conditions Quality Metrics: Data quality indicators and filtering parameters Annotation Details: Standardized cell type annotations and biological classifications

    Data Source and Licensing This dataset is derived from publicly available single-cell gene expression data, potentially sourced from: CELLxGENE Data Portal (https://cellxgene.cziscience.com/) Gene Expression Omnibus (GEO) European Bioinformatics Institute (EBI) Other public genomics repositories

    License: Creative Commons CC BY 4.0 (or specify the actual license) ✅ Commercial use allowed ✅ Modification allowed ✅ Distribution allowed ✅ Private use allowed ❗ Attribution required

    Research Applications Cell Type Discovery: Identify novel cell types and subtypes Comparative Genomics: Study cellular differences across conditions, tissues, or species Disease Research: Investigate cellular changes in disease states Developmental Biology: Analyze cellular differentiation and development patterns

    Machine Learning Applications Classification Tasks: Predict cell types from gene expression data Clustering Analysis: Discover cellular subpopulations and states Dimensionality Reduction: Apply PCA, t-SNE, UMAP for visualization Biomarker Discovery: Identify genes characteristic of specific cell types

    Educational Use : Teaching bioinformatics and computational biology concepts. Demonstrating single-cell analysis workflows. Training in data preprocessing and quality control.

    Data Quality and Preprocessing : Quality Control: Metadata has been curated and standardized Missing Values: [Specify how missing values are handled] Standardization: Cell type annotations follow established ontologies (e.g., Cell Ontology) Validation: Data has been cross-referenced with original publications

    Usage Guidelines : Getting Started- Load the metadata files using pandas or your preferred data analysis tool. Explore the cell type distributions and experimental conditions. Filter data based on quality metrics as needed. Join with corresponding gene expression data for comprehensive analysis.

    Best Practices Always cite original data sources and publications. Consider batch effects when combining data from different experiments. Validate findings with independent datasets when possible. Follow established bioinformatics workflows for single-cell analysis.

    Citation and Acknowledgments : If you use this dataset in your research, please: Cite this dataset:[Kazi Aishikuzzaman]. (2024). Cell Gene Expression Metadata. Kaggle. https://www.kaggle.com/datasets/kaziaishikuzzaman/cell-gene-expression-metadata

    File Structure : dataset- ─ metadata_summary.csv # Main metadata file ─ cell_type_annotations.csv # Detailed cell type information
    ─ experimental_conditions.csv # Experiment-specific metadata ─ quality_metrics.csv # Data quality indicators ─ README.txt # Detailed file descriptions

    Technical Specifications : File Encoding: UTF-8 Separator: Comma-separated values (CSV) Missing Values: Represented as 'NA' or empty cells Data Types: Mixed (categorical, numerical, text)

    Contact and Support : For questions about this dataset: Kaggle Profile: @kaziaishikuzzaman Dataset Issues: Use Kaggle's discussion section Collaboration: Open to research collaborations and improvements

    Version History : v1.0: Initial release with comprehensive metadata collection [Future versions]: Updates and additional annotations as available

    Related Datasets: Consider exploring these complementary datasets- Single-cell gene expression data (companion to this metadata) Cell atlas datasets from major consortiums Disease-specific single-cell studies Multi-omics datasets with matching cell types

    Keywords: single-cell, RNA-seq, genomics, cell types, metadata, bioinformatics, machine learning, computational biology Category: Biology > Genomics

  11. Dataset for practice session 1 in bioinformatics

    • figshare.com
    txt
    Updated Jul 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Sugis (2016). Dataset for practice session 1 in bioinformatics [Dataset]. http://doi.org/10.6084/m9.figshare.3490211.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 17, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Elena Sugis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course

  12. Data from: Data reuse and the open data citation advantage

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +3more
    bin, csv, txt
    Updated May 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather A. Piwowar; Todd J. Vision; Heather A. Piwowar; Todd J. Vision (2022). Data from: Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
    Explore at:
    bin, csv, txtAvailable download formats
    Dataset updated
    May 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Heather A. Piwowar; Todd J. Vision; Heather A. Piwowar; Todd J. Vision
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

  13. [Dataset] Data for the course "Population Genomics" at Aarhus University

    • zenodo.org
    application/gzip, bin
    Updated Jan 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuele Soraggi; Samuele Soraggi; Kasper Munch; Kasper Munch (2025). [Dataset] Data for the course "Population Genomics" at Aarhus University [Dataset]. http://doi.org/10.5281/zenodo.7670839
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Samuele Soraggi; Samuele Soraggi; Kasper Munch; Kasper Munch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets, conda environments and Softwares for the course "Population Genomics" of Prof Kasper Munch. This course material is maintained by the health data science sandbox. This webpage shows the latest version of the course material.

    1. Data.tar.gz Contains the datasets and executable files for some of the softwares
      You can unpack by simply doing
      tar -zxf Data.tar.gz -C ./
      This will create a folder called Data with the uncompressed material inside
    2. Course_Env.packed.tar.gz Contains the conda environment used for the course. This needs to be unpacked to adjust all the prefixes (Note this environment is created on Ubuntu 22.10). You do this in the command line by
      1. creating the folder Course_Env: mkdir Course_Env
      2. untar the file: tar -zxf Course_Env.packed.tar.gz -C Course_Env
      3. Activate the environment: conda activate ./Course_Env
      4. Run the unpacking script (it can take quite some time to get it done): conda-unpack
    3. Course_Env.unpacked.tar.gz The same environment as above, but will work only if untarred into the folder /usr/Material - so use the version above if you are using it in another folder. This file is mostly to execute the course in our own cloud environment.
    4. environment_with_args.yml The file needed to generate the conda environment. Create and activate the environment with the following commands:
      1. conda env create -f environment_with_args.yml -p ./Course_Env
      2. conda activate ./Course_Env

    The data is connected to the following repository: https://github.com/hds-sandbox/Popgen_course_aarhus. The original course material from Prof Kasper Munch is at https://github.com/kaspermunch/PopulationGenomicsCourse.

    Description

    The participants will after the course have detailed knowledge of the methods and applications required to perform a typical population genomic study.

    The participants must at the end of the course be able to:

    • Identify an experimental platform relevant to a population genomic analysis.
    • Apply commonly used population genomic methods.
    • Explain the theory behind common population genomic methods.
    • Reflect on strengths and limitations of population genomic methods.
    • Interpret and analyze results of population genomic inference.
    • Formulate population genetics hypotheses based on data

    The course introduces key concepts in population genomics from generation of population genetic data sets to the most common population genetic analyses and association studies. The first part of the course focuses on generation of population genetic data sets. The second part introduces the most common population genetic analyses and their theoretical background. Here topics include analysis of demography, population structure, recombination and selection. The last part of the course focus on applications of population genetic data sets for association studies in relation to human health.

    Curriculum

    The curriculum for each week is listed below. "Coop" refers to a set of lecture notes by Graham Coop that we will use throughout the course.

    Course plan

    1. Course intro and overview:
    2. Drift and the coalescent:
    3. Recombination:
    4. Population strucure and incomplete lineage sorting:
    5. Hidden Markov models:
    6. Ancestral recombination graphs:
    7. Past population demography:
    8. Direct and linked selection:
    9. Admixture:
    10. Genome-wide association study (GWAS):
    11. Heritability:
      • Lecture: Coop Lecture notes Sec. 2.2 (p23-36) + Chap. 7 (p119-142)
      • Exercise: Association testing
    12. Evolution and disease:
      • Lecture: Coop Lecture notes Sec. 11.0.1 (p217-221)
      • Exercise: Estimating heritability
  14. f

    Data from: Advancing computational biology and bioinformatics research...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
    Explore at:
    Dataset updated
    Sep 27, 2019
    Authors
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
    Description

    Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.

  15. Table1_Construction of a potentially functional lncRNA-miRNA-mRNA network in...

    • frontiersin.figshare.com
    bin
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li-ming Zheng; Jun-qiu Ye; Heng-fei Li; Quan Liu (2023). Table1_Construction of a potentially functional lncRNA-miRNA-mRNA network in sepsis by bioinformatics analysis.docx [Dataset]. http://doi.org/10.3389/fgene.2022.1031589.s009
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Li-ming Zheng; Jun-qiu Ye; Heng-fei Li; Quan Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objective: Sepsis is a common disease in internal medicine, with a high incidence and dangerous condition. Due to the limited understanding of its pathogenesis, the prognosis is poor. The goal of this project is to screen potential biomarkers for the diagnosis of sepsis and to identify competitive endogenous RNA (ceRNA) networks associated with sepsis.Methods: The expression profiles of long non-coding RNAs (lncRNAs), microRNAs (miRNAs) and messenger RNAs (mRNAs) were derived from the Gene Expression Omnibus (GEO) dataset. The differentially expressed lncRNAs (DElncRNAs), miRNAs (DEmiRNAs) and mRNAs (DEmRNAs) were screened by bioinformatics analysis. DEmRNAs were analyzed by protein-protein interaction (PPI) network analysis, transcription factor enrichment analysis, Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis and Gene Set Enrichment Analysis (GSEA). After the prediction of the relevant database, the competitive ceRNA network is built in Cytoscape. The gene-drug interaction was predicted by DGIgb. Finally, quantitative real-time polymerase chain reaction (qRT-PCR) was used to confirm five lncRNAs from the ceRNA network.Results: Through Venn diagram analysis, we found that 57 DElncRNAs, 6 DEmiRNAs and 317 DEmRNAs expressed abnormally in patients with sepsis. GO analysis and KEGG pathway analysis showed that 789 GO terms and 36 KEGG pathways were enriched. Through intersection analysis and data mining, 5 key KEGG pathways and related core genes were revealed by GSEA. The PPI network consists of 247 nodes and 1,163 edges, and 50 hub genes are screened by the MCODE plug-in. In addition, there are 5 DElncRNAs, 6 DEmiRNAs and 28 DEmRNAs in the ceRNA network. Drug action analysis showed that 7 genes were predicted to be molecular targets of drugs. Five lncRNAs in ceRNA network are verified by qRT-PCR, and the results showed that the relative expression of five lncRNAs was significantly different between sepsis patients and healthy control subjects.Conclusion: A sepsis-specific ceRNA network has been effectively created, which is helpful to understand the interaction between lncRNAs, miRNAs and mRNAs. We discovered prospective sepsis peripheral blood indicators and proposed potential treatment medicines, providing new insights into the progression and development of sepsis.

  16. 🧫 Promoter or not? - Bioinformatics 🗃️ Dataset

    • kaggle.com
    zip
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samira Shemirani (2024). 🧫 Promoter or not? - Bioinformatics 🗃️ Dataset [Dataset]. https://www.kaggle.com/datasets/samira1992/promoter-or-not-bioinformatics-dataset
    Explore at:
    zip(4992691 bytes)Available download formats
    Dataset updated
    Mar 31, 2024
    Authors
    Samira Shemirani
    Description

    The promoter region is located near the transcription start sites, which regulate the transcription initiation of the gene by controlling the binding of RNA polymerase. Thus, recognition of the promoter region is an important area of interest in the field of bioinformatics. Over the past years, many new promoter prediction programs (PPPs) have emerged. PPPs aim to identify promoter regions in a genome using computational methods. Promoter prediction is a supervised learning problem that consists of three main steps to extract features: 1) CpG islands 2) Structural features 3) Content features

  17. f

    Data_Sheet_1_Validation of a Bioinformatics Workflow for Routine Analysis of...

    • datasetcatalog.nlm.nih.gov
    Updated Mar 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roosens, Nancy H. C.; Mattheus, Wesley; Fu, Qiang; Ceyssens, Pieter-Jan; Vanneste, Kevin; De Keersmaecker, Sigrid C. J.; Van Braekel, Julien; Bertrand, Sophie; Bogaerts, Bert; Winand, Raf (2019). Data_Sheet_1_Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000172206
    Explore at:
    Dataset updated
    Mar 6, 2019
    Authors
    Roosens, Nancy H. C.; Mattheus, Wesley; Fu, Qiang; Ceyssens, Pieter-Jan; Vanneste, Kevin; De Keersmaecker, Sigrid C. J.; Van Braekel, Julien; Bertrand, Sophie; Bogaerts, Bert; Winand, Raf
    Description

    Despite being a well-established research method, the use of whole-genome sequencing (WGS) for routine molecular typing and pathogen characterization remains a substantial challenge due to the required bioinformatics resources and/or expertise. Moreover, many national reference laboratories and centers, as well as other laboratories working under a quality system, require extensive validation to demonstrate that employed methods are “fit-for-purpose” and provide high-quality results. A harmonized framework with guidelines for the validation of WGS workflows does currently, however, not exist yet, despite several recent case studies highlighting the urgent need thereof. We present a validation strategy focusing specifically on the exhaustive characterization of the bioinformatics analysis of a WGS workflow designed to replace conventionally employed molecular typing methods for microbial isolates in a representative small-scale laboratory, using the pathogen Neisseria meningitidis as a proof-of-concept. We adapted several classically employed performance metrics specifically toward three different bioinformatics assays: resistance gene characterization (based on the ARG-ANNOT, ResFinder, CARD, and NDARO databases), several commonly employed typing schemas (including, among others, core genome multilocus sequence typing), and serogroup determination. We analyzed a core validation dataset of 67 well-characterized samples typed by means of classical genotypic and/or phenotypic methods that were sequenced in-house, allowing to evaluate repeatability, reproducibility, accuracy, precision, sensitivity, and specificity of the different bioinformatics assays. We also analyzed an extended validation dataset composed of publicly available WGS data for 64 samples by comparing results of the different bioinformatics assays against results obtained from commonly used bioinformatics tools. We demonstrate high performance, with values for all performance metrics >87%, >97%, and >90% for the resistance gene characterization, sequence typing, and serogroup determination assays, respectively, for both validation datasets. Our WGS workflow has been made publicly available as a “push-button” pipeline for Illumina data at https://galaxy.sciensano.be to showcase its implementation for non-profit and/or academic usage. Our validation strategy can be adapted to other WGS workflows for other pathogens of interest and demonstrates the added value and feasibility of employing WGS with the aim of being integrated into routine use in an applied public health setting.

  18. e

    A microarray meta-dataset of liver cancer

    • ebi.ac.uk
    Updated Apr 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Su Bin Lim (2019). A microarray meta-dataset of liver cancer [Dataset]. https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6695
    Explore at:
    Dataset updated
    Apr 10, 2019
    Authors
    Su Bin Lim
    Description

    We present a meta-dataset comprising of a total of 401 samples including both primary tumors and tumor-free liver tissues from seven independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.

  19. m

    Research data for "Subjective data models in bioinformatics: Do wet-lab and...

    • figshare.manchester.ac.uk
    • explore.openaire.eu
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer (2023). Research data for "Subjective data models in bioinformatics: Do wet-lab and computational biologists comprehend data differently?" [Dataset]. http://doi.org/10.48420/20641017.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    University of Manchester
    Authors
    Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Subjective data models dataset

    This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".

    Todo: link paper/preprint once published.

    Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis

    Files

    Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview. Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase. Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review. anonymous_participant_list.csv shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set. cards.txt is a full list of the cards presented in the tasks. background survey and background manual annotations are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text. codes.csv shows the qualitative codes used within the transcripts. entry_point.csv is a record of participants' identified entry points into the data. file_mapping_responses shows a record of responses to the file mapping task.

  20. d

    Data from: Discovery of facultative parthenogenesis in a New World crocodile...

    • datadryad.org
    • nde-dev.biothings.io
    • +4more
    zip
    Updated Mar 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brenna Levine; Warren Booth (2023). Discovery of facultative parthenogenesis in a New World crocodile [Dataset]. http://doi.org/10.5061/dryad.7sqv9s4x1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 17, 2023
    Dataset provided by
    Dryad
    Authors
    Brenna Levine; Warren Booth
    Time period covered
    Mar 15, 2023
    Description

    DNA extracted from the mother and fetus using a Qiagen DNeasy Blood & Tissue kit was sent to Novogene (Sacramento, CA) for whole genome sequencing on an Illumina platform (NovaSeq 6000 PE150). Raw sequences were mapped by Novogene to the Saltwater crocodile, Crocodylus porosus, reference genome with single nucleotide polymorphisms (SNPs) identified by Novogene using the following the command in SAMtools: mpileup -m 2 -F 0.002 -d 10. Following the parameters used in Card et al. as a guideline, variants were filtered using VCFtools v. 0.1.16, the R package vcfR, and bedtools v2.30.0, with the following criteria: (1) indels were excluded; (2) individuals with a read depth of less than 5 were excluded; (3) variants with a Phred quality score below 30 were excluded; (4) non-biallelic SNPs were excluded; (5) SNPs with significant statistical biases were removed using the hard filter ‘MQ < 40.0’; (6) SNPs were thinned to avoid the potential effects of linkage by randomly selecting one v...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu