100+ datasets found
  1. C

    Bioinformatics for Researchers in Life Sciences: Tools and Learning...

    • data.iadb.org
    csv, pdf
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IDB Datasets (2025). Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources [Dataset]. http://doi.org/10.60966/kwvb-wr19
    Explore at:
    csv(355108), pdf(2989058), csv(276253)Available download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    IDB Datasets
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2020 - Jan 1, 2021
    Description

    The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.

  2. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  3. Bioinformatics infrastructure and training summary

    • search.datacite.org
    Updated Jan 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Loman; Thomas Connor (2016). Bioinformatics infrastructure and training summary [Dataset]. http://doi.org/10.6084/m9.figshare.1572287.v1
    Explore at:
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Figsharehttp://figshare.com/
    Authors
    Nicholas Loman; Thomas Connor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We undertook a poll of bioinformaticians, marketed through Twitter, in order to understand more about the current issues with bioinformatics practice and training. Methods: Through using a public Google Form we asked questions relating to frustrations, working practices, limitations of working practices. We also assessed whether the survey participant was UK based and what level of self-declared skill they had. Users had the opportunity to read the other responses to the survey, and edit or delete their answers. Results: This fileset presents the form, the responses (in Excel and CSV format) and the summary responses. The results may be of use for those wishing to understand more about the current issues facing bioinformaticians and bioinformatics training. The results are distributed under the CC-BY license. We are grateful to all participants who took the time to fill out this survey.

  4. r

    Novel classification scheme for temporal genomic and proteomic problems

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Kocbek; Gregor Stiglic; Mateja Verlic; Peter Kokol (2022). Novel classification scheme for temporal genomic and proteomic problems [Dataset]. http://doi.org/10.4225/03/5a1373171c74f
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Simon Kocbek; Gregor Stiglic; Mateja Verlic; Peter Kokol
    Description

    For over a decade genomic and proteomic datasets present a challenge for various statistical and machine learning methods. Most of microarray or mass spectrometry based datasets consist of a small number of samples with a large number of gene or protein expression measurements, but in the past few years new types of datasets with an additional time component are becoming available. This type of datasets offer new opportunities for development of new classification and gene selection techniques where one of the problems is the reduction of high-dimensionality. This paper presents a novel classification technique which combines feature extraction and feature selection to obtain the optimal set of genes available to a classifier. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  5. results.tar.gz

    • figshare.com
    application/x-gzip
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Costa (2024). results.tar.gz [Dataset]. http://doi.org/10.6084/m9.figshare.26880556.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Sep 2, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Francesco Costa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results from "Keeping it in the family: Using protein family templates to rescue low confidence AlphaFold2 models"

  6. f

    Data_Sheet_8_Diatom DNA Metabarcoding for Biomonitoring: Strategies to Avoid...

    • datasetcatalog.nlm.nih.gov
    Updated Oct 30, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tapolczai, Kálmán; Kahlert, Maria; Rimet, Frédéric; Bouchez, Agnès; Keck, François; Vasselon, Valentin (2019). Data_Sheet_8_Diatom DNA Metabarcoding for Biomonitoring: Strategies to Avoid Major Taxonomical and Bioinformatical Biases Limiting Molecular Indices Capacities.CSV [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000060725
    Explore at:
    Dataset updated
    Oct 30, 2019
    Authors
    Tapolczai, Kálmán; Kahlert, Maria; Rimet, Frédéric; Bouchez, Agnès; Keck, François; Vasselon, Valentin
    Description

    Recent years provided intense progression in the implementation of molecular techniques in a wide variety of research fields in ecology. Biomonitoring and bioassessment can greatly benefit from DNA metabarcoding and High-Throughput Sequencing (HTS) methods that potentially provide reliable, high quantity and quality standardized data in a cost- and time-efficient way. However, DNA metabarcoding has its drawbacks, introducing biases at all the steps of the process, particularly during bioinformatics treatments used to prepare HTS data for ecological analyses. The high diversity of bioinformatics methods (e.g., OTU clustering, chimera detection, taxonomic assignment) and parameters (e.g., percentage similarity threshold used to define OTUs) make inter-studies comparison difficult, limiting the development of standardized and easy-accessible bioassessment procedures for routine freshwater monitoring. In order to study and overcome these drawbacks, we constructed four de novo indices to assess river ecological status based on the same biological samples of diatoms analyzed with morphological and molecular methods. The biological inventories produced are (i) morphospecies identified by microscopy, (ii) OTUs provided via metabarcoding and hierarchical clustering of sequences using a 95% similarity threshold, (iii) individual sequence units (ISUs) via metabarcoding and only minimal bioinformatical quality filtering, and (iv) exact sequence variants (ESVs) using DADA2 denoising algorithm. The indices based on molecular data operated directly with ecological values estimated for OTUs/ ISUs/ ESVs. Our study used an approach of bypassing taxonomic assignment, so bias related to unclassified sequences missing from reference libraries could be handled and no information on ecology of sequences is lost. Additionally, we showed that the indices based on ISUs and ESVs were equivalent, outperforming the OTU-based one in terms of predictive power and accuracy by revealing the hidden ecological information of sequences that are otherwise clustered in the same OTU (intra-species/intra-population variability). Furthermore, ISUs, ESVs, and morphospecies indices provided similar estimation of site ecological status, validating that ISUs with limited bioinformatics treatments may be used for DNA freshwater monitoring. Our study is a proof of concept where taxonomy- and clustering-free approach is presented, that we believe is a step forward a standardized and comparable DNA bioassessment, complementary to morphological methods.

  7. H

    Replication data for: Nearest Neighbor Networks: clustering expression data...

    • dataverse.harvard.edu
    Updated Feb 8, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C. Huttenhower; A. Flamholz; J. Landis; S. Sahi; C. Myers; K. Olszewski; M. Hibbs; Siemers, N.,; O. Troyanskaya; H Coller (2010). Replication data for: Nearest Neighbor Networks: clustering expression data based on gene neighborhoods [Dataset]. http://doi.org/10.7910/DVN/PO4EWY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2010
    Dataset provided by
    Harvard Dataverse
    Authors
    C. Huttenhower; A. Flamholz; J. Landis; S. Sahi; C. Myers; K. Olszewski; M. Hibbs; Siemers, N.,; O. Troyanskaya; H Coller
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Background The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes). Results We developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows gene s with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods. Conclusion The Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision

  8. Accurate and fast feature selection workflow for high-dimensional omics data...

    • plos.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasset Perez-Riverol; Max Kuhn; Juan Antonio Vizcaíno; Marc-Phillip Hitz; Enrique Audain (2023). Accurate and fast feature selection workflow for high-dimensional omics data [Dataset]. http://doi.org/10.1371/journal.pone.0189875
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yasset Perez-Riverol; Max Kuhn; Juan Antonio Vizcaíno; Marc-Phillip Hitz; Enrique Audain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We are moving into the age of ‘Big Data’ in biomedical research and bioinformatics. This trend could be encapsulated in this simple formula: D = S * F, where the volume of data generated (D) increases in both dimensions: the number of samples (S) and the number of sample features (F). Frequently, a typical omics classification includes redundant and irrelevant features (e.g. genes or proteins) that can result in long computation times; decrease of the model performance and the selection of suboptimal features (genes and proteins) after the classification/regression step. Multiple algorithms and reviews has been published to describe all the existing methods for feature selection, their strengths and weakness. However, the selection of the correct FS algorithm and strategy constitutes an enormous challenge. Despite the number and diversity of algorithms available, the proper choice of an approach for facing a specific problem often falls in a ‘grey zone’. In this study, we select a subset of FS methods to develop an efficient workflow and an R package for bioinformatics machine learning problems. We cover relevant issues concerning FS, ranging from domain’s problems to algorithm solutions and computational tools. Finally, we use seven different proteomics and gene expression datasets to evaluate the workflow and guide the FS process.

  9. r

    Data from: Particle swarm optimization for MEG source localization

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenwei Jiang; Bin Wang; Liming Zhang (2022). Particle swarm optimization for MEG source localization [Dataset]. http://doi.org/10.4225/03/5a1371e5b5842
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Chenwei Jiang; Bin Wang; Liming Zhang
    Description

    The estimation of three-dimension neural active sources from the magnetoencephalography (MEG) record is a very critical issue for both clinical neurology and brain functions research. Nowadays multiple signal classification (MUSIC) algorithm and recursive MUSIC algorithm are widely used to locate dipolar sources from MEG data. The drawback of these algorithms is that they need excessive calculation and is quite time-consuming when scanning a three-dimensional space. In order to solve this problem, we propose a MEG sources localization scheme based on an improved Particle Swarm Optimization (PSO). This scheme uses the advantage of global searching ability of PSO to estimate the rough source location. Then combining with grids search in small area, the accurate dipolar source localization is performed. In addition, we compare the results of our method with those based on Genetic Algorithm (GA). Computer simulation results show that our PSO strategy is an effective and precise approach to dipole localization which can improve the speed greatly and localize the sources accurately. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  10. q

    Sequence Similarity: An inquiry based and "under the hood" approach for...

    • qubeshub.org
    Updated Aug 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Kleinschmit*; Benita Brink; Steven Roof; Carlos Goller; Sabrina Robertson (2021). Sequence Similarity: An inquiry based and "under the hood" approach for incorporating molecular sequence alignment in introductory undergraduate biology courses [Dataset]. http://doi.org/10.24918/cs.2019.5
    Explore at:
    Dataset updated
    Aug 28, 2021
    Dataset provided by
    QUBES
    Authors
    Adam Kleinschmit*; Benita Brink; Steven Roof; Carlos Goller; Sabrina Robertson
    Description

    Introductory bioinformatics exercises often walk students through the use of computational tools, but often provide little understanding of what a computational tool does "under the hood." A solid understanding of how a bioinformatics computational algorithm functions, including its limitations, is key for interpreting the output in a biologically relevant context. This introductory bioinformatics exercise integrates an introduction to web-based sequence alignment algorithms with models to facilitate student reflection and appreciation for how computational tools provide similarity output data. The exercise concludes with a set of inquiry-based questions in which students may apply computational tools to solve a real biological problem.

    In the module, students first define sequence similarity and then investigate how similarity can be quantitatively compared between two similar length proteins using a Blocks Substitution Matrix (BLOSUM) scoring matrix. Students then look for local regions of similarity between a sequence query and subjects within a large database using Basic Local Alignment Search Tool (BLAST). Lastly, students access text-based FASTA-formatted sequence information via National Center for Biotechnology Information (NCBI) databases as they collect sequences for a multiple sequence alignment using Clustal Omega to generate a phylogram and evaluate evolutionary relationships. The combination of diverse, inquiry-based questions, paper models, and web-based computational resources provides students with a solid basis for more advanced bioinformatics topics and an appreciation for the importance of bioinformatics tools across the discipline of biology.

  11. r

    Data from: Algorithms for detecting protein complexes in PPI networks: an...

    • researchdata.edu.au
    • bridges.monash.edu
    • +1more
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Wu; Xiaoli Li; Chee-Keong Kwoh (2022). Algorithms for detecting protein complexes in PPI networks: an evaluation study [Dataset]. http://doi.org/10.4225/03/5a137247533fb
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Min Wu; Xiaoli Li; Chee-Keong Kwoh
    Description

    Since protein complexes play important biological roles in cells, many computational methods have been proposed to detect protein complexes from protein-protein interaction (PPI) data. In this paper, we first review four reputed protein-complex detection algorithms (MCODE[2], MCL[21], CPA[1] and DECAFF[14]) and then present a comprehensive evaluation among them on two popular yeast PPI data3. We also discuss their relative strengthes and disadvantages to guide interested researchers. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  12. f

    Table_1_The Development of a Sustainable Bioinformatics Training Environment...

    • datasetcatalog.nlm.nih.gov
    Updated Sep 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ras, Verena; Mulder, Nicola; Panji, Sumir; Chauke, Paballo Abel; Johnston, Katherine; Aron, Shaun (2021). Table_1_The Development of a Sustainable Bioinformatics Training Environment Within the H3Africa Bioinformatics Network (H3ABioNet).pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000881136
    Explore at:
    Dataset updated
    Sep 23, 2021
    Authors
    Ras, Verena; Mulder, Nicola; Panji, Sumir; Chauke, Paballo Abel; Johnston, Katherine; Aron, Shaun
    Description

    Bioinformatics training programs have been developed independently around the world based on the perceived needs of the local and global academic communities. The field of bioinformatics is complicated by the need to train audiences from diverse backgrounds in a variety of topics to various levels of competencies. While there have been several attempts to develop standardised approaches to provide bioinformatics training globally, the challenges encountered in resource limited settings hinder the adaptation of these global approaches. H3ABioNet, a Pan-African Bioinformatics Network with 27 nodes in 16 African countries, has realised that there is no single simple solution to this challenge and has rather, over the years, evolved and adapted training approaches to create a sustainable training environment, with several components that allow for the successful dissemination of bioinformatics knowledge to diverse audiences. This has been achieved through the implementation of a combination of training modalities and sharing of high quality training material and experiences. The results highlight the success of implementing this multi-pronged approach to training, to reach audiences from different backgrounds and provide training in a variety of different areas of expertise. While face-to-face training was initially required and successful, the mixed-model teaching approach allowed for an increased reach, providing training in advanced analysis topics to reach large audiences across the continent with minimal teaching resources. The transition to hackathons provided an environment to allow the progression of skills, once basic skills had been developed, together with the development of real-world solutions to bioinformatics problems. Ensuring our training materials are FAIR, and through synergistic collaborations with global training partners, the reach of our training materials extends beyond H3ABioNet. Coupled with the opportunity to develop additional career building soft skills, such as scientific communication, H3ABioNet has created a flexible, sustainable and high quality bioinformatics training environment that has successfully been implemented to train several highly skilled African bioinformaticians on the continent.

  13. G

    Gene Prediction Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Gene Prediction Software Report [Dataset]. https://www.datainsightsmarket.com/reports/gene-prediction-software-1454510
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The gene prediction software market is experiencing robust growth, driven by the escalating demand for accurate and efficient gene identification in various life science research domains. The market's expansion is fueled by several key factors, including the increasing adoption of next-generation sequencing (NGS) technologies, which generate massive amounts of genomic data requiring sophisticated analysis tools. Furthermore, advancements in algorithms and machine learning are significantly enhancing the accuracy and speed of gene prediction, leading to improved research outcomes and drug discovery efforts. The rising prevalence of genetic disorders and the growing need for personalized medicine are also contributing to the market's upward trajectory. Competitive pressures among key players like Illumina, Qiagen, and BGI Genomics are stimulating innovation and pushing the development of more advanced and user-friendly software solutions. While data limitations prevent precise quantification, a reasonable estimation of the 2025 market size, considering the growth of related fields, could be around $500 million, projecting a CAGR of 15% over the forecast period (2025-2033). This growth is expected to be consistent, though potentially influenced by fluctuations in research funding and technological advancements. The market segmentation is likely diverse, encompassing software solutions tailored to specific organisms (e.g., bacterial, eukaryotic) and functionalities (e.g., gene finding, splice site prediction, promoter identification). Geographical distribution is anticipated to be heavily concentrated in regions with well-established research infrastructure and funding, such as North America and Europe, though growth in Asia-Pacific is expected to be significant. However, challenges such as the high cost of software licenses and the need for specialized expertise to operate these tools may restrain market expansion in certain regions. The ongoing development of open-source alternatives and cloud-based solutions could potentially mitigate these constraints, making gene prediction software more accessible and affordable to a wider range of researchers.

  14. Data sets used in the experiments.

    • plos.figshare.com
    xls
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miika Leinonen; Leena Salmela (2023). Data sets used in the experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0294415.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Miika Leinonen; Leena Salmela
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer counting methods. We propose SAKE, a method to extract long k-mers from high error rate reads by utilizing strobemers and consensus k-mer generation through partial order alignment. Our experiments show that on simulated data with up to 6% error rate, SAKE can extract 97-mers with over 90% recall. Conversely, the recall of DSK, an exact k-mer counter, drops to less than 20%. Furthermore, the precision of SAKE remains similar to DSK. On real bacterial data, SAKE retrieves 97-mers with a recall of over 90% and slightly lower precision than DSK, while the recall of DSK already drops to 50%. We show that SAKE can extract more k-mers from uncorrected high error rate reads compared to exact k-mer counting. However, exact k-mer counters run on corrected reads can extract slightly more k-mers than SAKE run on uncorrected reads.

  15. h

    MSA-nuc-9-seq

    • huggingface.co
    Updated Dec 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edo Dotan (2023). MSA-nuc-9-seq [Dataset]. https://huggingface.co/datasets/dotan1111/MSA-nuc-9-seq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2023
    Authors
    Edo Dotan
    Description

    Multiple Sequence Alignment as a Sequence-to-Sequence Learning Problem

      Abstract:
    

    The sequence alignment problem is one of the most fundamental problems in bioinformatics and a plethora of methods were devised to tackle it. Here we introduce BetaAlign, a methodology for aligning sequences using an NLP approach. BetaAlign accounts for the possible variability of the evolutionary process among different datasets by using an ensemble of transformers, each trained on millions… See the full description on the dataset page: https://huggingface.co/datasets/dotan1111/MSA-nuc-9-seq.

  16. r

    k-Word matches: an alignment-free sequence comparison method

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Conrad J. Burden; Sylvain Forêt; Susan R. Wilson (2022). k-Word matches: an alignment-free sequence comparison method [Dataset]. http://doi.org/10.4225/03/5a1372cde0ad8
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Conrad J. Burden; Sylvain Forêt; Susan R. Wilson
    Description

    k-word matches, the number of words of length k shared between two sequences, also known as the D2 statistic, are used in alignment-free sequence comparison statistic. The advantages of the use of this statistic over alignment-based methods for nucleotide and amino-acid sequence comparisons are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. We summarise our results to date on determing the distributional properties of the D2 statistic for a range of biologically relevant parameters and outline the directions in which the research will proceed. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  17. n

    1000 Genomes Project and AWS

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Mar 30, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). 1000 Genomes Project and AWS [Dataset]. http://identifiers.org/RRID:SCR_008801
    Explore at:
    Dataset updated
    Mar 30, 2012
    Description

    A dataset containing the full genomic sequence of 1,700 individuals, freely available for research use. The 1000 Genomes Project is an international research effort coordinated by a consortium of 75 companies and organizations to establish the most detailed catalogue of human genetic variation. The project has grown to 200 terabytes of genomic data including DNA sequenced from more than 1,700 individuals that researchers can now access on AWS for use in disease research free of charge. The dataset containing the full genomic sequence of 1,700 individuals is now available to all via Amazon S3. The data can be found at: http://s3.amazonaws.com/1000genomes The 1000 Genomes Project aims to include the genomes of more than 2,662 individuals from 26 populations around the world, and the NIH will continue to add the remaining genome samples to the data collection this year. Public Data Sets on AWS provide a centralized repository of public data hosted on Amazon Simple Storage Service (Amazon S3). The data can be seamlessly accessed from AWS services such Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic MapReduce (Amazon EMR), which provide organizations with the highly scalable compute resources needed to take advantage of these large data collections. AWS is storing the public data sets at no charge to the community. Researchers pay only for the additional AWS resources they need for further processing or analysis of the data. All 200 TB of the latest 1000 Genomes Project data is available in a publicly available Amazon S3 bucket. You can access the data via simple HTTP requests, or take advantage of the AWS SDKs in languages such as Ruby, Java, Python, .NET and PHP. Researchers can use the Amazon EC2 utility computing service to dive into this data without the usual capital investment required to work with data at this scale. AWS also provides a number of orchestration and automation services to help teams make their research available to others to remix and reuse. Making the data available via a bucket in Amazon S3 also means that customers can crunch the information using Hadoop via Amazon Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow.

  18. S

    Single Molecule Real Time Sequencing Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Single Molecule Real Time Sequencing Report [Dataset]. https://www.datainsightsmarket.com/reports/single-molecule-real-time-sequencing-1991650
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Nov 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Single Molecule Real-Time (SMRT) Sequencing market is poised for significant expansion, projected to reach an estimated market size of approximately $1.8 billion by 2025, with a robust Compound Annual Growth Rate (CAGR) of around 22% anticipated to sustain this momentum through 2033. This upward trajectory is primarily fueled by the increasing adoption of SMRT sequencing in academic and research institutes, driven by its unparalleled accuracy and ability to detect epigenetic modifications and structural variations that are crucial for groundbreaking biological discoveries. The pharmaceutical and biotechnology sectors are also heavily invested, leveraging SMRT sequencing for advanced drug discovery, development, and personalized medicine initiatives, particularly in areas like oncology and infectious diseases. Furthermore, hospitals and clinics are increasingly integrating SMRT sequencing into diagnostic workflows, especially for complex genetic disorders and pathogen identification, underscoring its growing clinical utility. The market's growth is further bolstered by continuous technological advancements that enhance sequencing speed, reduce costs, and improve data analysis capabilities. Trends such as the rise of long-read sequencing for comprehensive genome assembly, the integration of SMRT sequencing with other omics technologies, and the development of sophisticated bioinformatics tools are collectively driving deeper insights into complex biological systems. However, the market faces certain restraints, including the initial capital investment required for SMRT sequencing platforms and the need for specialized bioinformatics expertise for data interpretation. Despite these challenges, the inherent advantages of SMRT sequencing in providing high-resolution, error-free data are expected to outweigh these limitations, driving widespread adoption across diverse applications and ensuring sustained market vitality in the coming years.

  19. m

    Data from: The strong convergence of visual classification method and its...

    • bridges.monash.edu
    pdf
    Updated Nov 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meng, Deyu; Xu, Zongben; Leung, Yee; Fung, Tung (2017). The strong convergence of visual classification method and its applications in disease diagnosis [Dataset]. http://doi.org/10.4225/03/5a1371f709257
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 21, 2017
    Dataset provided by
    Monash University
    Authors
    Meng, Deyu; Xu, Zongben; Leung, Yee; Fung, Tung
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    Visual classification method is introduced as a learning strategy for pattern classification problem in bioinformatics. In this paper, we show the strong convergence property of the proposed method. In particular, the method is shown to converge to the Bayes estimator, i.e., the learning error of the method tends to achieve the posterior expected minimal value. The method is successfully applied to some practical disease diagnosis problems. The experimental results all verify the validity and effectiveness of the theoretical conclusions. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  20. Data from: Chromosome assembly of large and complex genomes using multiple...

    • ckan.earlham.ac.uk
    Updated Aug 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.earlham.ac.uk (2019). Chromosome assembly of large and complex genomes using multiple references [Dataset]. https://ckan.earlham.ac.uk/dataset/446e8321-b6ff-4583-a53f-311c2b4bf5c2
    Explore at:
    Dataset updated
    Aug 1, 2019
    Dataset provided by
    CKANhttps://ckan.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the rapid development of sequencing technologies, the assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout 2, a reference-assisted assembly tool that works for large and complex genomes. By taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout 2 infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. By using Ragout 2, we transformed NGS assemblies of 16 laboratory mouse strains into sets of complete chromosomes, leaving <5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long Pacific Biosciences (PacBio) reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. We applied Ragout 2 to the Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared with other genomes from the Muridae family. Chromosome painting maps confirmed most large-scale rearrangements that Ragout 2 detected. We applied Ragout 2 to improve draft sequences of three ape genomes that have recently been published. Ragout 2 transformed three sets of contigs (generated using PacBio reads only) into chromosome-scale assemblies with accuracy comparable to chromosome assemblies generated in the original study using BioNano maps, Hi-C, BAC clones, and FISH.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
IDB Datasets (2025). Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources [Dataset]. http://doi.org/10.60966/kwvb-wr19

Bioinformatics for Researchers in Life Sciences: Tools and Learning Resources

Explore at:
csv(355108), pdf(2989058), csv(276253)Available download formats
Dataset updated
Apr 10, 2025
Dataset provided by
IDB Datasets
License

Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically

Time period covered
Jan 1, 2020 - Jan 1, 2021
Description

The COVID-19 pandemic has shown that bioinformatics--a multidisciplinary field that combines biological knowledge with computer programming concerned with the acquisition, storage, analysis, and dissemination of biological data--has a fundamental role in scientific research strategies in all disciplines involved in fighting the virus and its variants. It aids in sequencing and annotating genomes and their observed mutations; analyzing gene and protein expression; simulation and modeling of DNA, RNA, proteins and biomolecular interactions; and mining of biological literature, among many other critical areas of research. Studies suggest that bioinformatics skills in the Latin American and Caribbean region are relatively incipient, and thus its scientific systems cannot take full advantage of the increasing availability of bioinformatic tools and data. This dataset is a catalog of bioinformatics software for researchers and professionals working in life sciences. It includes more than 300 different tools for varied uses, such as data analysis, visualization, repositories and databases, data storage services, scientific communication, marketplace and collaboration, and lab resource management. Most tools are available as web-based or desktop applications, while others are programming libraries. It also includes 10 suggested entries for other third-party repositories that could be of use.

Search
Clear search
Close search
Google apps
Main menu