92 datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. f

    Functional Annotation of the Human Chromosome 7 “Missing” Proteins: A...

    • figshare.com
    • acs.figshare.com
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoba Ranganathan; Javed M. Khan; Gagan Garg; Mark S. Baker (2023). Functional Annotation of the Human Chromosome 7 “Missing” Proteins: A Bioinformatics Approach [Dataset]. http://doi.org/10.1021/pr301082p.s004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    ACS Publications
    Authors
    Shoba Ranganathan; Javed M. Khan; Gagan Garg; Mark S. Baker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The chromosome-centric human proteome project aims to systematically map all human proteins, chromosome by chromosome, in a gene-centric manner through dedicated efforts from national and international teams. This mapping will lead to a knowledge-based resource defining the full set of proteins encoded in each chromosome and laying the foundation for the development of a standardized approach to analyze the massive proteomic data sets currently being generated. The neXtProt database lists 946 proteins as the human proteome of chromosome 7. However, 170 (18%) proteins of human chromosome 7 have no evidence at the proteomic, antibody, or structural levels and are considered “missing” in this study as they lack experimental support. We have developed a protocol for the functional annotation of these “missing” proteins by integrating several bioinformatics analysis and annotation tools, sequential BLAST homology searches, protein domain/motif and gene ontology (GO) mapping, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Using the BLAST search strategy, homologues for reviewed non-human mammalian proteins with protein evidence were identified for 90 “missing” proteins while another 38 had reviewed non-human mammalian homologues. Putative functional annotations were assigned to 27 of the remaining 43 novel proteins. Proteotypic peptides have been computationally generated to facilitate rapid identification of these proteins. Four of the “missing” chromosome 7 proteins have been substantiated by the ENCODE proteogenomic peptide data.

  3. d

    Data from: Alignment-based protein mutational landscape prediction: doing...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina Abakarova; Céline Marquet; Michael Rera; Burkhard Rost; Elodie Laine (2024). Alignment-based protein mutational landscape prediction: doing more with less [Dataset]. http://doi.org/10.5061/dryad.vdncjsz1s
    Explore at:
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Marina Abakarova; Céline Marquet; Michael Rera; Burkhard Rost; Elodie Laine
    Time period covered
    Jan 1, 2023
    Description

    The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.

    , , , # Alignment-based protein mutational landscape prediction: doing more with less.

    Access this dataset on Dryad

    This dataset contains the data and tools associated with Alignment-based protein mutational landscape prediction: doing more with less, Abakarova et al., Genome Biology and Evolution, 2023. doi:

    Description of the data and file structure

    We provide the community with data associated with our assessment of four different multiple sequence alignment (MSA) resources and protocols, as well as the complete single-mutational landscape of the human proteome predicted by combining the MSA protocol implemented in ColabFold and the variant effect predictor GEMME.

    1. ProteinGym_assessment.tgz contains the data and scripts associated with our assessment of the four different MSA generation protocols (ColabFold, ProteinGym, ProteinNet, Pfam) against the ProteinGym substitution benchmark. This archive is organised as follo...
  4. n

    Bioinformatics Links Directory

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Oct 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Bioinformatics Links Directory [Dataset]. http://identifiers.org/RRID:SCR_008018
    Explore at:
    Dataset updated
    Oct 13, 2024
    Description

    Database of curated links to molecular resources, tools and databases selected on the basis of recommendations from bioinformatics experts in the field. This resource relies on input from its community of bioinformatics users for suggestions. Starting in 2003, it has also started listing all links contained in the NAR Webserver issue. The different types of information available in this portal: * Computer Related: This category contains links to resources relating to programming languages often used in bioinformatics. Other tools of the trade, such as web development and database resources, are also included here. * Sequence Comparison: Tools and resources for the comparison of sequences including sequence similarity searching, alignment tools, and general comparative genomics resources. * DNA: This category contains links to useful resources for DNA sequence analyses such as tools for comparative sequence analysis and sequence assembly. Links to programs for sequence manipulation, primer design, and sequence retrieval and submission are also listed here. * Education: Links to information about the techniques, materials, people, places, and events of the greater bioinformatics community. Included are current news headlines, literature sources, educational material and links to bioinformatics courses and workshops. * Expression: Links to tools for predicting the expression, alternative splicing, and regulation of a gene sequence are found here. This section also contains links to databases, methods, and analysis tools for protein expression, SAGE, EST, and microarray data. * Human Genome: This section contains links to draft annotations of the human genome in addition to resources for sequence polymorphisms and genomics. Also included are links related to ethical discussions surrounding the study of the human genome. * Literature: Links to resources related to published literature, including tools to search for articles and through literature abstracts. Additional text mining resources, open access resources, and literature goldmines are also listed. * Model Organisms: Included in this category are links to resources for various model organisms ranging from mammals to microbes. These include databases and tools for genome scale analyses. * Other Molecules: Bioinformatics tools related to molecules other than DNA, RNA, and protein. This category will include resources for the bioinformatics of small molecules as well as for other biopolymers including carbohydrates and metabolites. * Protein: This category contains links to useful resources for protein sequence and structure analyses. Resources for phylogenetic analyses, prediction of protein features, and analyses of interactions are also found here. * RNA: Resources include links to sequence retrieval programs, structure prediction and visualization tools, motif search programs, and information on various functional RNAs.

  5. e

    Data from: PROSITE

    • prosite.expasy.org
    • the-mouth.com
    • +7more
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE [Dataset]. https://prosite.expasy.org/
    Explore at:
    Dataset updated
    Feb 5, 2025
    Description

    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].

  6. r

    Data from: Hydrophobic-hydrophilic forces and their effects on protein...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trent Higgs; Bela Stantic; Tamjidul Hoque; Abdul Sattar (2022). Hydrophobic-hydrophilic forces and their effects on protein structural similarity [Dataset]. http://doi.org/10.4225/03/5a13709f243b5
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Trent Higgs; Bela Stantic; Tamjidul Hoque; Abdul Sattar
    Description

    Hydrophobic-hydrophilic interactions have a strong impact on the three-dimensional structure a protein will adopt. Because structure, not amino acid sequence order, carry out certain functions it is important to understand how these forces affect the protein folding process. In recent years, a lot of focus has been dedicated towards ab initio protein folding prediction, which tries to predict a proteins native conformation from its sequence alone. To aid this type of prediction sub-conformations from already known proteins are used to limit the free energy conformational search space. In this paper we looked into the sub-conformations’ hydrophobic-hydrophilic nature by incorporating a HP approach and proposed a way of evaluating how these type of forces affect the protein folding process. By doing this, we can gain insight into how hydrophobic-hydrophilic interactions affect protein structural similarity, and thus aid us in picking more suitable sub-conformations based off their HP shape for use in protein structure prediction. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  7. s

    iPTMnet

    • scicrunch.org
    • rrid.site
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). iPTMnet [Dataset]. http://identifiers.org/RRID:SCR_014416
    Explore at:
    Dataset updated
    Mar 12, 2025
    Description

    A protein database which connects multiple disparate bioinformatics tools and systems text mining, data mining, analysis and visualization tools, and databases and ontologies.

  8. Z

    FTDMP docking results for protein-protein, protein-DNA, protein-RNA...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Banciul, Rita (2024). FTDMP docking results for protein-protein, protein-DNA, protein-RNA benchmarks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12804207
    Explore at:
    Dataset updated
    Aug 29, 2024
    Dataset authored and provided by
    Banciul, Rita
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FTDMP docking results for protein-protein, protein-DNA, protein-RNA benchmarks.

    FTDMP is a software system for running docking experiments and scoring/ranking multimeric models. This dataset contains FTDMP docking results for protein-protein, protein-DNA, protein-RNA benchmarks. The FTDMP framework itself is available at https://github.com/kliment-olechnovic/ftdmp.

    Every *.tar.gz file in this dataset contains two folders: results for unbound-unbound and bound-bound docking. These folders contain results for the benchmark cases:

    252 folders with results for the protein-protein docking benchmark cases [1].47 folders with results for the protein-DNA docking benchmark cases [2].42 folders with results for the protein-RNA docking benchmark cases [3-6].

    Every folder is named according to the PDB ID of the complex. The folders contain:

    1. A subfolder named relaxed_top_complexes. This subfolder contains 200 pdb files of relaxed [7] top docking models.2. A text file named scoring_results-ranks.txt. It contains the names of the models (that are in the relaxed_top_complexes folder) in the ranked order. This means that the first model in the file is considered to be the best prediction by the FTDMP framework.3. A text file named cad_scores.txt. It contains interface CAD-score and binding site CAD-score [8] results for every model.4. A text file named rmsd_results.txt, which is available only for protein-DNA and protein-RNA cases. The file contains ligand-RMSD values for the models, where the DNA/RNA is considered as the ligand.5. A text file named DockQ_results.txt, which is available only for the protein-protein docking cases. The file contains DockQ [9] results for every model, as well as model accuracy based on CAPRI criteria (Incorrect, Acceptable, Medium, High)6. A text file named binding_site_CAD-scores.txt, which contains the binding site CAD-score from the protein side for RNA and DNA docking. This binding site CAD-score shows how accurately the ligand (DNA/RNA) was docked to the protein without taking the orientation of the ligand into consideration. In the case of protein-protein docking the binding site CAD-score file is available only for antibody-antigen docking targets and contains the binding site (epitope) CAD-score for the antigen.

    The ligand-RMSD, CAD-scores, and DockQ scores were all calculated by comparing the models to the corresponding targets. The target structures are available at https://zenodo.org/records/10517524. These target structures have the same residue numbering as the models available here.

    REFERENCES

    [1] Guest, J. D., et al. (2021). An expanded benchmark for antibody-antigen docking and affinity prediction reveals insights into antibody recognition determinants. Structure, 29(6), 606–621.e5.[2] van Dijk, M., Bonvin, A.M. (2008). A protein-DNA docking benchmark. Nucleic Acids Res, 36, e88. [3] Perez-Cano, L., et. Al. (2012). A protein-RNA docking benchmark (II): extended set from experimental and homology modeling data. Proteins, 80(7): 1872-1882. [4] Huang, S.Y., Zou, X. (2013). A nonredundant structure dataset for benchmarking protein-RNA computational docking. J Comput Chem, 34(4): 311-318. [5] Nithin, C., et. al. (2017). A non-redundant protein-RNA docking benchmark version 2.0. Proteins, 85(2) :256-267. [6] Zheng, J., et al. (2020). P3DOCK: a protein-RNA docking webserver based on template-based and template-free docking. Bioinformatics, 36(1), 96–103. [7] Eastman, P., et al.(2017). OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLOS Comp. Biol., 13(7): e1005659. [8] Olechnovic, K., Venclovas, C. (2020). Contact area-based structural analysis of proteins and their complexes using CAD-score. Methods Mol Biol, 2112, 75.[9] Basu, S., Wallner, B. (2016). DockQ: A Quality Measure for Protein-Protein Docking Models. PLoS ONE 11(8): e0161879.

  9. Bioinformatics Market Analysis North America, Europe, Asia, Rest of World...

    • technavio.com
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2022). Bioinformatics Market Analysis North America, Europe, Asia, Rest of World (ROW) - US, Germany, UK, Canada, France - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/bioinformatics-market-industry-analysis
    Explore at:
    Dataset updated
    Feb 23, 2022
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Global
    Description

    Snapshot img

    Bioinformatics Market Size 2024-2028

    The bioinformatics market size is forecast to increase by USD 13.2 billion at a CAGR of 16.59% between 2023 and 2028. The market is experiencing significant growth due to the reduction in the cost of genetic sequencing and the development of sophisticated bioinformatics tools for next-generation sequencing (NGS). These advancements are enabling the identification and analysis of disease biomarkers, leading to the discovery of new therapeutic strategies. The market is also driven by the increasing demand for database development and management systems to store and analyze the vast amounts of data generated from NGS. Furthermore, the potential of gene therapy and drug development in treating various diseases is fueling the market growth. However, the shortage of trained laboratory professionals poses a challenge to the market, as the analysis of complex genomic data requires specialized expertise.

    What will be the Size of the Bioinformatics Market During the Forecast Period?

    To learn more about the bioinformatics market report, Request Free Sample

    Bioinformatics is a rapidly growing market, driven by advancements in genome sequencing and NGS technologies. Precision medicine, which utilizes genomic information for personalized healthcare, is a key application area. The market is witnessing a significant decrease in equipment costs, making genomics instruments more accessible to researchers and healthcare providers. Transcriptomics, which focuses on the study of RNA, is another emerging field. Virus research is a significant application area, with a focus on transmission chains, public health control, and containment measures. Virus variability and vaccine development are major challenges, driving the need for advanced diagnostic methods. Key players in the market include Illumina and Eurofins Scientific.

    Moreover, companies are making strides in addressing this challenge by providing comprehensive solutions for bioinformatics analysis and data management. Big data is another key trend in the market, with the use of advanced algorithms and machine learning techniques to extract valuable insights from genomic data. Overall, the market is poised for strong growth, driven by technological advancements, increasing demand for personalized medicine, and the potential to revolutionize disease diagnosis and treatment. In addition, these companies provide a range of services, from DNA and RNA sequencing to bioinformatics analysis and diagnostic testing. The market is expected to grow significantly due to the increasing demand for accurate and timely diagnostic methods and the ongoing research in the field of genomics and transcriptomics.

    The bioinformatics market is expanding rapidly, driven by advancements in genomics data analysis, next-gen sequencing, and precision medicine. Cloud-based bioinformatics solutions and AI in bioinformatics are revolutionizing molecular diagnostics, drug discovery platforms, and protein analysis tools. The market emphasizes genomic data storage, personalized healthcare, and biomarker discovery. With bioinformatics software, computational biology, and integrative bioinformatics solutions, bioinformatics as a service plays a pivotal role in advancing modern healthcare.

    Bioinformatics Market Segmentation

    The bioinformatics market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

    Application
    
      Molecular phylogenetics
      Transcriptomic
      Proteomics
      Metabolomics
    
    
    Product
    
      Platforms
      Tools
      Services
    
    
    Geography
    
      North America
    
        Canada
        US
    
    
      Europe
    
        Germany
        UK
        France
    
    
      Asia
    
    
    
      Rest of World (ROW)
    

    By Application Insights

    The molecular phylogenetics segment is estimated to witness significant growth during the forecast period. Bioinformatics, a critical field in molecular biology, encompasses the application of computational tools and techniques to analyze biological data. One significant area within bioinformatics is molecular phylogenetics, which utilizes molecular data to explore evolutionary relationships among various species. This technique has transformed the biological landscape by offering more precise and comprehensive insights into the interconnections among living organisms. In the international market, molecular phylogenetics is a vital instrument in numerous research domains, such as clinical diagnostics, drug discovery, RNA-based therapeutics, and conservation biology. For instance, in the realm of viral research, molecular phylogenetics is extensively employed to examine the evolution of viruses.

    In addition, by deciphering the molecular data of distinct strains of viruses, scientists can trace the origins and dissemination patterns of these pathoge

  10. Z

    Data from: Classifying protein kinase conformations with machine learning

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan REVEGUK (2023). Classifying protein kinase conformations with machine learning [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_8172570
    Explore at:
    Dataset updated
    Jul 23, 2023
    Dataset authored and provided by
    Ivan REVEGUK
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data collection accompanies the manuscript "Classifying protein kinase conformations with machine learning".

    It is created using the kinactive v0.1 tool written in pure Python>=3.10. Note that the data are provided for the reference and reproducibility purposes and will not be compatible with later versions of kinactive built upon lXtractor > 0.1.1. Refer to the kinactive documentation for instructions on how to obtain an actualized version of the structural kinome collection.

    File descriptions:

    db_v3.tar.gz -- a structural kinome collection archive. One can unpack it and inspect the contents or use load it into the Python interpreter using kinactive or lXtractor tools.

    default_*_vs.tsv -- structure/sequence variables calculated with lXtractor and used in an interpretable ML pipeline.

    *_features.tsv -- lists of ranked features selected by the eBoruta tool for each classifier.

    Supplement_labels.tsv -- ML model predictions for each PK domain structure found in db_v3.

  11. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
    Explore at:
    application/gzip, bz2, zipAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

    Please use the gunzip command to extract files with a '.gz' extension.

    CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
    Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • ted_324m_seq_clustering.cathlabels.tsv
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id - TED domain identifier in the format AF-
    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The files contain a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID - TED chain identifier in the format AF-
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.
    • cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.
    • ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)
    • gofocus_data.tar.bz2 - GOFocus model weights
  12. wor_turnover

    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    michele tinti (2023). wor_turnover [Dataset]. http://doi.org/10.6084/m9.figshare.9919235.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    michele tinti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    To establish baseline data on T. brucei proteome turnover, a stable isotopelabelling with amino acids in cell culture (SILAC)-based mass spectrometry analysis wasperformed to reveal the synthesis and degradation profiles for thousands of proteins in thebloodstream and procyclic forms of this parasite.

  13. r

    European Nucleotide Archive (ENA)

    • rrid.site
    • scicrunch.org
    • +2more
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). European Nucleotide Archive (ENA) [Dataset]. http://identifiers.org/RRID:SCR_006515
    Explore at:
    Dataset updated
    Feb 9, 2025
    Description

    Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.

  14. f

    Functional categories of Coding Sequences in the genome of...

    • figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nupoor Chowdhary; Ashok Selvaraj; Lakshmi KrishnaKumaar; Gopal Ramesh Kumar (2023). Functional categories of Coding Sequences in the genome of Caldicellulosiruptor saccharolyticus before and after re-annotation. [Dataset]. http://doi.org/10.1371/journal.pone.0133183.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Nupoor Chowdhary; Ashok Selvaraj; Lakshmi KrishnaKumaar; Gopal Ramesh Kumar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *Number of protein-encoding genes in each category without pseudogenes.# (X/Y) = > X: value belonging to the old CDSs / Y: value belonging to the new CDSs.Functional categories of Coding Sequences in the genome of Caldicellulosiruptor saccharolyticus before and after re-annotation.

  15. m

    Data from: Discrimination of GO term annotated proteins based on amino acid...

    • bridges.monash.edu
    pdf
    Updated Nov 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taguchi, Y. H.; Gromiha, M. Michael (2017). Discrimination of GO term annotated proteins based on amino acid occurrence and composition [Dataset]. http://doi.org/10.4225/03/5a137191a1e39
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 21, 2017
    Dataset provided by
    Monash University
    Authors
    Taguchi, Y. H.; Gromiha, M. Michael
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    In this paper, we have applied linear discriminant analysis and support vector machine for predicting GO term annotated proteins using amino acid occurrence/composition in uniref50 data set, i.e., uniprot with less than 50 % sequence identity.We found that our method could discriminate between proteins with at least one known GO term and those without any annotation at an AUC of 0.82 using three-fold cross validation test. Discrimination of the 38 most frequent GO terms is achieved with the maximum AUC of 0.91. Our method is solely based on amino acid sequence and hence it will be useful to predict GO term associations of newly obtained amino acid sequence without any annotated known homolog. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  16. f

    Functional categories of newly predicted Coding Sequences in the genome of...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nupoor Chowdhary; Ashok Selvaraj; Lakshmi KrishnaKumaar; Gopal Ramesh Kumar (2023). Functional categories of newly predicted Coding Sequences in the genome of Caldicellulosiruptor saccharolyticus. [Dataset]. http://doi.org/10.1371/journal.pone.0133183.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Nupoor Chowdhary; Ashok Selvaraj; Lakshmi KrishnaKumaar; Gopal Ramesh Kumar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Functional categories of newly predicted Coding Sequences in the genome of Caldicellulosiruptor saccharolyticus.

  17. M

    Molecular Biology Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMA Research & Media LLP (2025). Molecular Biology Software Report [Dataset]. https://www.archivemarketresearch.com/reports/molecular-biology-software-32340
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Feb 17, 2025
    Dataset provided by
    AMA Research & Media LLP
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global molecular biology software market is projected to reach USD 4.2 billion by 2033, exhibiting a CAGR of 10.5% during the forecast period (2023-2033). The market growth is primarily driven by the increasing demand for efficient and accurate tools for molecular biology research. The advancements in genomics and personalized medicine, coupled with the growing need for bioinformatics analysis, are further fueling the market expansion. The market is highly competitive, with established players such as QIAGEN, DNASTAR, Inc., SCIEX, SoftGenetics, LLC., and Thermo Fisher Scientific holding significant market shares. These companies offer a wide range of software solutions for plasmid mapping, DNA/protein database search and analysis, primer design, and other bioinformatics applications. The market also includes emerging players such as Benchling, CapitalBio Technology, and Biomatters, who are gaining traction with innovative software solutions and collaborations with research institutions. The market is segmented based on type, application, and region, with the research segment accounting for the largest market share due to the extensive use of molecular biology software in academic and medical research.

  18. f

    Data from: Algorithms for detecting protein complexes in PPI networks: an...

    • figshare.com
    • bridges.monash.edu
    pdf
    Updated Nov 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wu, Min; Li, Xiaoli; Kwoh, Chee-Keong (2017). Algorithms for detecting protein complexes in PPI networks: an evaluation study [Dataset]. http://doi.org/10.4225/03/5a137247533fb
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 21, 2017
    Dataset provided by
    Monash University
    Authors
    Wu, Min; Li, Xiaoli; Kwoh, Chee-Keong
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    Since protein complexes play important biological roles in cells, many computational methods have been proposed to detect protein complexes from protein-protein interaction (PPI) data. In this paper, we first review four reputed protein-complex detection algorithms (MCODE[2], MCL[21], CPA[1] and DECAFF[14]) and then present a comprehensive evaluation among them on two popular yeast PPI data3. We also discuss their relative strengthes and disadvantages to guide interested researchers. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  19. s

    PRODORIC

    • scicrunch.org
    • rrid.site
    • +2more
    Updated Jun 11, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). PRODORIC [Dataset]. http://identifiers.org/RRID:SCR_007074
    Explore at:
    Dataset updated
    Jun 11, 2014
    Description

    Database about gene regulation and gene expression in prokaryotes. It includes a manually curated and unique collection of transcription factor binding sites. A variety of bioinformatics tools for the prediction, analysis and visualization of regulons and gene reglulatory networks is included. The integrated approach provides information about molecular networks in prokaryotes with focus on pathogenic organisms. In detail this concerns: * transcriptional regulation (transcription factors and their DNA binding sites * signal transduction (two-component systems, phosphylation cascades) * protein interactions (complex formation, oligomerization) * biochemical pathways (chemical reactions) * other regulation events (e.g. codon usage, etc. ...) It aims to be a resource to model protein-host interactions and to be a suitable platform to analyze high-throughput data from proteomis and transcriptomics experiments (systems biology). Currently it mainly contains detailed information about operon and promoter structures including huge collections of transcription factor binding sites. If an appropriate number of regulatory binding sites is available, a position weight matrix (PWM) and a sequence logo is provided, which can be used to predict new binding sites. This data is collected manually by screening the original scientific literature. PRODORIC also handles protein-protein interactions and signal-transduction cascades that commonly occur in form of two-component systems in prokaryotes. Furthermore it contains metabolic network data imported from the KEGG database.

  20. r

    PDBe - Protein Data Bank in Europe

    • rrid.site
    • scicrunch.org
    • +1more
    Updated Mar 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PDBe - Protein Data Bank in Europe [Dataset]. http://identifiers.org/RRID:SCR_004312
    Explore at:
    Dataset updated
    Mar 22, 2025
    Description

    The European resource for the collection, organization and dissemination of data on biological macromolecular structures. In collaboration with the other worldwide Protein Data Bank (wwPDB) partners - the Research Collaboratory for Structural Bioinformatics (RCSB) and BioMagResBank (BMRB) in the USA and the Protein Data Bank of Japan (PDBj) - they work to collate, maintain and provide access to the global repository of macromolecular structure data. The main objectives of the work at PDBe are: * to provide an integrated resource of high-quality macromolecular structures and related data and make it available to the biomedical community via intuitive user interfaces. * to maintain in-house expertise in all the major structure-determination techniques (X-ray, NMR and EM) in order to stay abreast of technical and methodological developments in these fields, and to work with the community on issues of mutual interest (such as data representation, harvesting, formats and standards, or validation of structural data). * to provide high-quality deposition and annotation facilities for structural data as one of the wwPDB deposition sites. Several sophisticated tools are also available for the structural analysis of macromolecules.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu