100+ datasets found
  1. b

    Bioinformatics: Technologies and Global Markets

    • bccresearch.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BCC Publishing (2023). Bioinformatics: Technologies and Global Markets [Dataset]. https://www.bccresearch.com/market-research/biotechnology/page3
    Explore at:
    Dataset updated
    Dec 7, 2023
    Dataset authored and provided by
    BCC Publishing
    License

    https://www.bccresearch.com/aboutus/terms-conditionshttps://www.bccresearch.com/aboutus/terms-conditions

    Description

    Explore BCC Research's comprehensive report on Bioinformatics technologies Market. This report aims to study current and historical market revenues can be estimated based on the services & platforms, solutions, and application type.

  2. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  3. w

    Introduction to bioinformatics

    • workwithdata.com
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Introduction to bioinformatics [Dataset]. https://www.workwithdata.com/object/introduction-to-bioinformatics-book-by-arthur-m-lesk-0000
    Explore at:
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction to bioinformatics is a book. It was written by Arthur M. Lesk and published by Oxford University Press in 2002.

  4. f

    Table_6_Identification of Potential Molecular Mechanism Related to Infertile...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiushen Li; Li Guo; Weiwen Zhang; Junli He; Lisha Ai; Chengwei Yu; Hao Wang; Weizheng Liang (2023). Table_6_Identification of Potential Molecular Mechanism Related to Infertile Endometriosis.XLSX [Dataset]. http://doi.org/10.3389/fvets.2022.845709.s011
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers
    Authors
    Xiushen Li; Li Guo; Weiwen Zhang; Junli He; Lisha Ai; Chengwei Yu; Hao Wang; Weizheng Liang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectivesIn this research, we aim to explore the bioinformatic mechanism of infertile endometriosis in order to identify new treatment targets and molecular mechanism.MethodsThe Gene Expression Omnibus (GEO) database was used to download MRNA sequencing data from infertile endometriosis patients. The “limma” package in R software was used to find differentially expressed genes (DEGs). Weighted gene co-expression network analysis (WGCNA) was used to classify genes into modules, further obtained the correlation coefficient between the modules and infertility endometriosis. The intersection genes of the most disease-related modular genes and DEGs are called gene set 1. To clarify the molecular mechanisms and potential therapeutic targets for infertile endometriosis, we used Gene Ontology (GO), Kyoto Gene and Genome Encyclopedia (KEGG) enrichment, Protein-Protein Interaction (PPI) networks, and Gene Set Enrichment Analysis (GSEA) on these intersecting genes. We identified lncRNAs and miRNAs linked with infertility and created competing endogenous RNAs (ceRNA) regulation networks using the Human MicroRNA Disease Database (HMDD), mirTarBase database, and LncRNA Disease database.ResultsFirstly, WGCNA enrichment analysis was used to examine the infertile endometriosis dataset GSE120103, and we discovered that the Meorangered1 module was the most significantly related with infertile endometriosis. The intersection genes were mostly enriched in the metabolism of different amino acids, the cGMP-PKG signaling pathway, and the cAMP signaling pathway according to KEGG enrichment analysis. The Meorangered1 module genes and DEGs were then subjected to bioinformatic analysis. The hub genes in the PPI network were performed KEGG enrichment analysis, and the results were consistent with the intersection gene analysis. Finally, we used the database to identify 13 miRNAs and two lncRNAs linked to infertility in order to create the ceRNA regulatory network linked to infertile endometriosis.ConclusionIn this study, we used a bioinformatics approach for the first time to identify amino acid metabolism as a possible major cause of infertility in patients with endometriosis and to provide potential targets for the diagnosis and treatment of these patients.

  5. w

    Bioinformatics and Systems Biology

    • data.wu.ac.at
    • datadiscoverystudio.org
    Updated Mar 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Laboratory Consortium (2017). Bioinformatics and Systems Biology [Dataset]. https://data.wu.ac.at/schema/data_gov/NWQzYzc3OWQtMTM2Zi00MDI0LTg2ZDMtOTZiOWQzMzIwNjcy
    Explore at:
    Dataset updated
    Mar 8, 2017
    Dataset provided by
    Federal Laboratory Consortium
    Description

    The Bioinformatics and Systems Biology (BISB) Core aims to assist investigators in overcoming the technical challenges in utilizing bioinformatics and systems biology techniques. The core will collaborate with principal investigators to incorporate systems biology approaches synergistically into their laboratory studies in order to speed the tempo of their research and develop transformative and translational results.

  6. d

    Raw motif mapping bedfile data and model training set class probabilities

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillip Davis (2023). Raw motif mapping bedfile data and model training set class probabilities [Dataset]. http://doi.org/10.5061/dryad.tdz08kq3w
    Explore at:
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Phillip Davis
    Time period covered
    Jan 1, 2023
    Description

    Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks. By leveraging a taxonomy-guided ‘group-shuffle-split’ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (...

  7. r

    Data from: Feature ranking and feature redundancy reduction for prognostic...

    • researchdata.edu.au
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qihua Tan; Mads Thomassen; Kaare Christensen; Torben A. Kruse (2022). Feature ranking and feature redundancy reduction for prognostic microarray study of tumor clinical outcomes [Dataset]. http://doi.org/10.4225/03/5a1372383442b
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Qihua Tan; Mads Thomassen; Kaare Christensen; Torben A. Kruse
    Description

    Different from significant gene expression analysis which looks for all genes that are differentially regulated, feature selection in prognostic gene expression analysis aims at finding a subset of informative marker genes that are discriminative for prediction. Unfortunately feature selection in the literature of microarray study is predominated by the simple heuristic univariate gene filter paradigm that selects differentially expressed genes according to their statistical significance. Since the univariate approach does not take into account the correlated or interactive structure among the genes, classifiers built on genes so selected can be less accurate. More advanced approaches based on multivariate models have to be considered. Here, we introduce a feature ranking method through forward orthogonal search to assist prognostic gene selection. Application to published gene-lists selected by univariate models shows that the feature space can be largely reduced while achieving improved testing performances. Our results indicate that "significant" features selected using the gene-wised approaches can contain irrelevant genes that only serve to complicate model building. Multivariate feature ranking can help to reduce feature redundancy and to select highly informative prognostic marker genes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  8. Data from: CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research...

    • search.datacite.org
    • data.mendeley.com
    • +1more
    Updated Dec 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stian Soiland-Reyes (2018). CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object) [Dataset]. http://doi.org/10.17632/xnwncxpw42
    Explore at:
    Dataset updated
    Dec 4, 2018
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Mendeley
    Authors
    Stian Soiland-Reyes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from: 1. Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM. 2. The Genome BAM file is processed using Picard MarkDuplicates producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation). 3. SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step. 4. The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics. 5. In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences. For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation. This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

  9. Dataset supporting the tool 'delfies: a Python package for the detection of...

    • zenodo.org
    application/gzip
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brice Letcher; Brice Letcher (2024). Dataset supporting the tool 'delfies: a Python package for the detection of DNA breakpoints with neo-telomere addition' [Dataset]. http://doi.org/10.5281/zenodo.14101798
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brice Letcher; Brice Letcher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose


    These data can be used to test my tool delfies on real data, to get a concrete sense of its inputs/outputs and test that it is
    properly installed.

    Description

    Genome

    I downloaded the genome of Oscheius onirici, accession: GCA_932521025.

    I subsampled the genome to the last 2kbp of chromosome I, which contains an elimination breakpoint,
    using `seqkit` v2.8.2, giving the FASTA file in this release.

    Sequencing data

    I then downloaded the following sequencing data for *O. onirici*, from the European Nucleotide Archive:

    • ERR5967937: Illumina NovaSeq 6000 paired end short reads. Reads are 2x150bp with average per-base quality of Q27.
    • ERR10796202: Oxford Nanopore PromethION long reads. Reads have average length 11.9kbp and average per-base quality Q11.4.
    • ERR7979900: Pacific Biosciences (PacBio) Sequel II long reads. Reads have average length 11.1kbp and average per-base quality Q28.

    And aligned them to the above genome with `minimap2` version 2.26-r1175, using the following presets:
    "map-ont" for the Nanopore data, "map-hifi" for the PacBio data, "sr" for the Illumina data.

    After sorting with `samtools`, this gives the BAM files in this release.

    Running delfies

    I then ran `delfies` version 0.6.0 on each BAM and genome, as:

    ```sh
    delfies --threads 16 \
    --telo_forward_seq TTAGGC \
    --breakpoint_type all \
    --min_mapq 20 \
    --min_supporting_reads 6 \
    \${genome} \${bam} \${odirname}
    ```

    The three resulting output directories are in this release, prefixed with `delfies_`.

    A single, identical breakpoint is found using all three BAMs (see files '*breakpoint_locations.bed').

    Data source

    The above raw data were produced and released by the Wellcome Sanger Institute as part of projects
    PRJEB51305 and PRJEB59023.

  10. Data from: Bioinformatic teaching resources - for educators, by educators -...

    • osti.gov
    Updated May 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioinformatic teaching resources - for educators, by educators - using KBase, a free, user-friendly, open source platform [Dataset]. https://www.osti.gov/biblio/1783189
    Explore at:
    Dataset updated
    May 17, 2021
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
    Description

    Over the past year, biology educators and staff at the Department of Energy Systems Biology Knowledgebase (KBase) initiated a collaborative effort to develop a curriculum for bioinformatics education. KBase is a free and easily accessible data science platform that integrates many bioinformatics resources into a graphical user interface built upon reproducible analysis notebooks. KBase held conversations with college and high school instructors to understand how KBase could potentially support their educational goals. These conversations morphed into a working group of biological and data science instructors that adapted the KBase platform to their curriculum needs, specifically around concepts in Genomics, Metagenomics, Pangenomics, and Phylogenetics. The KBase Educators Working Group developed modular, adaptable, and customizable instructional units. Each instructional module contains teaching resources, publicly available data, analysis tools, and markdown capability to tailor instructions and learning goals for each class. The online user interface enables students to conduct hands-on data science research and analyses without requiring programming skills or their own computational resources (these are provided by KBase). Alongside these resources, KBase continues to work with instructors, supporting the development of additional curriculum modules. For anyone new to the platform, KBase, and the growing KBase Educators Organization, provides a community network, accompanied by community-sourced guidelines, instructional templates, and peer support to use KBase within a classroom whether virtual or in-person.

  11. Example FAIRtracks JSON document - augmented

    • zenodo.org
    • data.niaid.nih.gov
    json
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eivind Hovig; Eivind Hovig; Salvador Capella-Gutierrez; Salvador Capella-Gutierrez; Finn Drabløs; Finn Drabløs; José M. Fernández; José M. Fernández; Sveinung Gundersen; Sveinung Gundersen; Radmila Kompova; Kieron Taylor; Kieron Taylor; Dmytro Titov; Dmytro Titov; Daniel Zerbino; Daniel Zerbino; Radmila Kompova (2023). Example FAIRtracks JSON document - augmented [Dataset]. http://doi.org/10.5281/zenodo.3984947
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eivind Hovig; Eivind Hovig; Salvador Capella-Gutierrez; Salvador Capella-Gutierrez; Finn Drabløs; Finn Drabløs; José M. Fernández; José M. Fernández; Sveinung Gundersen; Sveinung Gundersen; Radmila Kompova; Kieron Taylor; Kieron Taylor; Dmytro Titov; Dmytro Titov; Daniel Zerbino; Daniel Zerbino; Radmila Kompova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background

    Many types of data from genomic analyses can be represented as genomic tracks, i.e. features linked to the genomic coordinates of a reference genome. Examples of such data are epigenetic DNA methylation data, ChIP-seq peaks, germline or somatic DNA variants, or RNA-seq expression levels. Researchers often face difficulties in locating, accessing and combining relevant tracks from external sources, as well as locating the raw data, reducing the value of the generated information.

    FAIRtracks software ecosystem

    We have, as an output of the ELIXIR Implementation Study "FAIRification of Genomic Tracks", developed a basic set of recommendations for genomic track metadata together with an implementation called FAIRtracks in the form of a JSON Schema. We propose FAIRtracks as a draft standard for genomic track metadata in order to advance the application of FAIR data principles (Findable, Accessible, Interoperable, and Reusable). We have demonstrated practical usage of this approach by designing a software ecosystem around the FAIRtracks draft standard, integrating globally identifiable metadata from various track hubs in the Track Hub Registry and other relevant repositories into a novel track search service, called TrackFind. The software ecosystem also includes the FAIRtracks augmentation service, which assists metadata producers by automatically augmenting minimal machine-readable metadata with their human-readable counterparts, as well as the FAIRtracks validation service, which extends basic JSON Schema validation to include FAIR-related features (global identifiers, ontology terms, and object references). Finally, we have implemented track metadata search and import functionality into relevant analytical tools: EPICO and the GSuite HyperBrowser. For an overview of the FAIRtracks software ecosystem, please visit: http://fairtracks.github.io/

    Example FAIRtracks JSON document - augmented

    The "Example FAIRtracks JSON document - augmented" is generated as part of the build process of the FAIRtracks draft standard JSON Schema (source code: https://github.com/fairtracks/fairtracks_standard/). The example FAIRtracks document contains a small selection of tracks and objects from the ENCODE project metadata (https://www.encodeproject.org/), adapted to align with the FAIRtracks draft standard. In addition to being available in the above-mentioned GitHub repository, the "Example FAIRtracks JSON document - augmented" is also published here on Zenodo in order for the document to be globally uniquely identifiable by a Digital Object Identifier (DOI).

  12. d

    (high-temp) No 4. Taxonomic: (16S rRNA/ITS) Output

    • dataone.org
    • smithsonian.figshare.com
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarrod Scott (2024). (high-temp) No 4. Taxonomic: (16S rRNA/ITS) Output [Dataset]. https://dataone.org/datasets/urn%3Auuid%3A5e3d7158-f083-4c08-8d6a-2054a5de1279
    Explore at:
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    Smithsonian Research Data Repository
    Authors
    Jarrod Scott
    Description

    Output files from the No 4. Taxonomic Workflow page of the SWELTR high- temp study. In this workflow we used the microeco package for taxonomic assessment. We first converted each phyloseq object into a microtable object using the file2meco package.

    taxa_wf.rdata : contains all variables and phyloseq objects from 16s rRNA and ITS ASV taxonomic assessment. To see the Objects, in R run _load("taxa_wf.rdata", verbose=TRUE)_

    Additional files:

    For convenience, we also include individual phyloseq and microtable objects (collected in zip files).

    I** _TS (its_taxa_objects.zip)_ :**
    its18_ps_work_me.rds : microtable object for the FULL (unfiltered) ITS data.
    its18_ps_filt_me.rds : microtable object for the Arbitrary filtered ITS data.
    its18_ps_perfect_me.rds : microtable object for the PERfect ITS data.
    its18_ps_pime_me.rds : microtable object for the PIME ITS data.

    _**16S rRNA (ssu_taxa_objects.zip):**_
    ssu18_ps_work_me.rds : microtable object for the FULL (unfiltered) 16S rRNA data.
    ssu18_ps_filt_me.rds : microtable object for the Arbitrary filtered 16S rRNA data.
    ssu18_ps_perfect_me.rds : microtable object for the PERfect 16S rRNA data.
    ssu18_ps_pime_me.rds : microtable object for the PIME 16S rRNA data.

    For one of the 16S rRNA analyses we looked at family-level diversity of major bacterial phyla. For this analysis, we renamed NA ranks by the next highest named rank. For example, ASV13884 was unclassifed at family level, so the NA was replaced with the next highest named rank (in this case order). Therefore the family-level classification for this ASV was changed to _o_Polyangiales_. Doing this allowed us to include uncalssifed abundance in our analyses. We include the following phyloseq objects containing the modifed taxonomies.

    ssu18_ps_work_clean.rds : modified phyloseq object for the FULL (unfiltered) 16S rRNA data.
    ssu18_ps_filt_clean.rds : modified phyloseq object for the Arbitrary filtered 16S rRNA data.
    ssu18_ps_perfect_clean.rds : modified phyloseq object for the PERfect filtered 16S rRNA data.
    ssu18_ps_pime_clean.rds : modified phyloseq object for the PIME filtered 16S rRNA data.

    Source code for the workflow can be found here:
    https://github.com/sweltr/high-temp/blob/master/taxa.Rmd

  13. n

    Poxvirus Bioinformatics Resource Center

    • neuinfo.org
    • dknet.org
    • +1more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Poxvirus Bioinformatics Resource Center [Dataset]. http://identifiers.org/RRID:SCR_007870
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    A database of information on pox viruses. Goals of this project are to acquire and annotate data on poxviruses, and to develop and utilize new tools to facilitate the study of this group of organisms. This basic research is being undertaken with an eye toward the development of novel antiviral therapies, vaccines against human orthopoxvirus infections, new approaches for the environmental detection of virions, and methods to accomplish more rapid diagnosis of disease.

  14. Z

    Data from: Classifying protein kinase conformations with machine learning

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan REVEGUK (2023). Classifying protein kinase conformations with machine learning [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_8172570
    Explore at:
    Dataset updated
    Jul 23, 2023
    Dataset authored and provided by
    Ivan REVEGUK
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data collection accompanies the manuscript "Classifying protein kinase conformations with machine learning".

    It is created using the kinactive v0.1 tool written in pure Python>=3.10. Note that the data are provided for the reference and reproducibility purposes and will not be compatible with later versions of kinactive built upon lXtractor > 0.1.1. Refer to the kinactive documentation for instructions on how to obtain an actualized version of the structural kinome collection.

    File descriptions:

    db_v3.tar.gz -- a structural kinome collection archive. One can unpack it and inspect the contents or use load it into the Python interpreter using kinactive or lXtractor tools.

    default_*_vs.tsv -- structure/sequence variables calculated with lXtractor and used in an interpretable ML pipeline.

    *_features.tsv -- lists of ranked features selected by the eBoruta tool for each classifier.

    Supplement_labels.tsv -- ML model predictions for each PK domain structure found in db_v3.

  15. r

    EchoBASE

    • rrid.site
    • dknet.org
    • +2more
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). EchoBASE [Dataset]. http://identifiers.org/RRID:SCR_002430/resolver?q=*&i=rrid
    Explore at:
    Dataset updated
    Mar 12, 2025
    Description

    A database that curates new experimental and bioinformatic information about the genes and gene products of the model bacterium Escherichia coli K-12 strain MG1655. It has been created to integrate information from post-genomic experiments into a single resource with the aim of providing functional predictions for the 1500 or so gene products for which we have no knowledge of their physiological function. While EchoBASE provides a basic annotation of the genome, taken from other databases, its novelty is in the curation of post-genomic experiments and their linkage to genes of unknown function. Experiments published on E. coli are curated to one of two levels. Papers dealing with the determination of function of a single gene are briefly described, while larger dataset are actually included in the database and can be searched and manipulated. This includes data for proteomics studies, protein-protein interaction studies, microarray data, functional genomic approaches (looking at multiple deletion strains for novel phenotypes) and a wide range of predictions that come out of in silico bioinformatic approaches. The aim of the database is to provide hypothesis for the functions of uncharacterized gene products that may be used by the E. coli research community to further our knowledge of this model bacterium.

  16. o

    WORKSHOP: Introduction to Machine Learning in R - from data to knowledge

    • explore.openaire.eu
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fotis Psomopoulos; Eden Zhang; Erin Graham; Giorgia Mori; Uwe Winter (2024). WORKSHOP: Introduction to Machine Learning in R - from data to knowledge [Dataset]. http://doi.org/10.5281/zenodo.14545611
    Explore at:
    Dataset updated
    Dec 9, 2024
    Authors
    Fotis Psomopoulos; Eden Zhang; Erin Graham; Giorgia Mori; Uwe Winter
    Description

    This record includes training materials associated with the Australian BioCommons workshop ‘Introduction to Machine Learning in R - from data to knowledge’. This workshop took place over one, 4 hour sessions on 09 December 2024. Event description With the rise in high-throughput sequencing technologies, the volume of omics data has grown exponentially. A major issue is to mine useful knowledge from these heterogeneous collections of data. The analysis of complex high-volume data is not trivial and classical tools cannot be used to explore their full potential. Machine Learning (ML), a discipline in which computers perform automated learning without being programmed explicitly and assist humans to make sense of large and complex data sets, can thus be very useful in mining large omics datasets to uncover new insights that can advance the field of bioinformatics. This hands-on workshop will introduce participants to the ML taxonomy and the applications of common ML algorithms to health data. The workshop will cover the foundational concepts and common methods being used to analyse omics data sets by providing a practical context through the use of basic but widely used R libraries. Participants will acquire an understanding of the standard ML processes, as well as the practical skills in applying them on familiar problems and publicly available real-world data sets. Materials are shared under a Creative Commons Attribution 4.0 International agreement unless otherwise specified and were current at the time of the event. Lead trainers: Dr Fotis Psomopoulos, Senior Researcher, Institute of Applied Biosciences (INAB), Center for Research and Technology Hellas (CERTH) Facilitators: Dr Giorgia Mori, Australian BioCommons Dr Eden Zhang, Sydney Informatics Hub Dr Erin Graham, Queensland Cyber Infrastructure Foundation (QCIF) Infrastructure provision: Uwe Winter, Australian BioCommons Host: Dr. Giorgia Mori, Australian BioCommons Training materials Files and materials included in this record: Event metadata (PDF): Information about the event including, description, event URL, learning objectives, prerequisites, technical requirements etc. Training materials webpage Data and documentation

  17. i

    02_lists

    • doi.ipk-gatersleben.de
    Updated Aug 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefanie Lueck; Stefanie Lueck (2022). 02_lists [Dataset]. https://doi.ipk-gatersleben.de/DOI/8bd9ba3d-7acb-4648-898c-8f360b04304e/7364d6fd-d4b1-42ea-8d93-6d0c33b334c2/0
    Explore at:
    Dataset updated
    Aug 25, 2022
    Dataset provided by
    e!DAL - Plant Genomics and Phenomics Research Data Repository (PGP), IPK Gatersleben, Seeland OT Gatersleben, Corrensstraße 3, 06466, Germany
    Authors
    Stefanie Lueck; Stefanie Lueck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the post-genomic era, every biologist is faced with the task of analyzing, interpreting and visualizing complex and huge data. An increasing number of scientists have begun writing small programs using script-based languages, such as Python. This course is designed to train students and scientists without previous experience in programming who want -- or need -- to write their own bioinformatics software tools. The aim of this training course is to provide an introduction to the Python programming language by solving everyday tasks of Bioinformatics. Each folder contains a short introduction video and a PDF file about the topic, assignments covering the topic (provided as Jupyter notebooks) as well as the solutions (provided as Python files).

    Please first read the content of the file 'readme.pdf' and follow the course plan according to the content of the file 'course_plan.pdf'.

    Please cite the authors for all course material if you use them in your work.

  18. ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tony Reina; Tony Reina (2024). ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task [Dataset]. http://doi.org/10.5281/zenodo.11894560
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tony Reina; Tony Reina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.

    A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.

    [HuggingFace](https://huggingface.co/facebook/esm2_t36_3B_UR50D) provides a "one-stop shop" to train and deploy AI models. In this case, we use Facebook's open-source [Evolutionary Scale Model (ESM-2)](https://github.com/facebookresearch/esm). These embeddings turn the protein sequences into a vector of numbers that the computer can use in a mathematical model.
    To load them into Python use the Pandas library:
    import pandas as pd
    
    train_data = pd.read_pickle("train_data.pkl")
    validation_data = pd.read_pickle("validation_data.pkl")
    test_data = pd.read_pickle("test_data.pkl")

    The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.

    The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.

    The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.

    The label column is whether the two proteins bind. 0 = No. 1 = Yes.

    The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.

    From the TDC website:

    T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.

    Weber et al.

    Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.

    Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.

    Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.

    References:

    1. Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.
    2. Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.
    3. Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).

    Dataset License: CC BY 4.0.

    Contributed by: Anna Weber and Jannis Born.

    The Facebook ESM-2 model has the MIT license and was published in:
    * Zeming Lin et al, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). DOI: 10.1126/science.ade2574 https://www.science.org/doi/10.1126/science.ade2574
    HuggingFace has several versions of the trained model.
    Checkpoint nameNumber of layersNumber of parameters
    esm2_t48_15B_UR50D4815B
    esm2_t36_3B_UR50D363B
    esm2_t33_650M_UR50D33650M
    esm2_t30_150M_UR50D30150M
    esm2_t12_35M_UR50D1235M
    esm2_t6_8M_UR50D68M
  19. f

    Table_1_Exploring the pathogenesis linking traumatic brain injury and...

    • figshare.com
    docx
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gengshui Zhao; Yongqi Fu; Chao Yang; Xuehui Yang; Xiaoxiao Hu (2023). Table_1_Exploring the pathogenesis linking traumatic brain injury and epilepsy via bioinformatic analyses.DOCX [Dataset]. http://doi.org/10.3389/fnagi.2022.1047908.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Frontiers
    Authors
    Gengshui Zhao; Yongqi Fu; Chao Yang; Xuehui Yang; Xiaoxiao Hu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traumatic brain injury (TBI) is a serious disease that could increase the risk of epilepsy. The purpose of this article is to explore the common molecular mechanism in TBI and epilepsy with the aim of providing a theoretical basis for the prevention and treatment of post-traumatic epilepsy (PTE). Two datasets of TBI and epilepsy in the Gene Expression Omnibus (GEO) database were downloaded. Functional enrichment analysis, protein–protein interaction (PPI) network construction, and hub gene identification were performed based on the cross-talk genes of aforementioned two diseases. Another dataset was used to validate these hub genes. Moreover, the abundance of infiltrating immune cells was evaluated through Immune Cell Abundance Identifier (ImmuCellAI). The common microRNAs (miRNAs) between TBI and epilepsy were acquired via the Human microRNA Disease Database (HMDD). The overlapped genes in cross-talk genes and target genes predicted through the TargetScan were obtained to construct the common miRNAs–mRNAs network. A total of 106 cross-talk genes were screened out, including 37 upregulated and 69 downregulated genes. Through the enrichment analyses, we showed that the terms about cytokine and immunity were enriched many times, particularly interferon gamma signaling pathway. Four critical hub genes were screened out for co-expression analysis. The miRNA–mRNA network revealed that three miRNAs may affect the shared interferon-induced genes, which might have essential roles in PTE. Our study showed the potential role of interferon gamma signaling pathway in pathogenesis of PTE, which may provide a promising target for future therapeutic interventions.

  20. f

    Adaptation of a Bioinformatics Microarray Analysis Workflow for a...

    • plos.figshare.com
    xlsx
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophie Depiereux; Bertrand De Meulder; Eric Bareke; Fabrice Berger; Florence Le Gac; Eric Depiereux; Patrick Kestemont (2023). Adaptation of a Bioinformatics Microarray Analysis Workflow for a Toxicogenomic Study in Rainbow Trout [Dataset]. http://doi.org/10.1371/journal.pone.0128598
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Sophie Depiereux; Bertrand De Meulder; Eric Bareke; Fabrice Berger; Florence Le Gac; Eric Depiereux; Patrick Kestemont
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sex steroids play a key role in triggering sex differentiation in fish, the use of exogenous hormone treatment leading to partial or complete sex reversal. This phenomenon has attracted attention since the discovery that even low environmental doses of exogenous steroids can adversely affect gonad morphology (ovotestis development) and induce reproductive failure. Modern genomic-based technologies have enhanced opportunities to find out mechanisms of actions (MOA) and identify biomarkers related to the toxic action of a compound. However, high throughput data interpretation relies on statistical analysis, species genomic resources, and bioinformatics tools. The goals of this study are to improve the knowledge of feminisation in fish, by the analysis of molecular responses in the gonads of rainbow trout fry after chronic exposure to several doses (0.01, 0.1, 1 and 10 μg/L) of ethynylestradiol (EE2) and to offer target genes as potential biomarkers of ovotestis development. We successfully adapted a bioinformatics microarray analysis workflow elaborated on human data to a toxicogenomic study using rainbow trout, a fish species lacking accurate functional annotation and genomic resources. The workflow allowed to obtain lists of genes supposed to be enriched in true positive differentially expressed genes (DEGs), which were subjected to over-representation analysis methods (ORA). Several pathways and ontologies, mostly related to cell division and metabolism, sexual reproduction and steroid production, were found significantly enriched in our analyses. Moreover, two sets of potential ovotestis biomarkers were selected using several criteria. The first group displayed specific potential biomarkers belonging to pathways/ontologies highlighted in the experiment. Among them, the early ovarian differentiation gene foxl2a was overexpressed. The second group, which was highly sensitive but not specific, included the DEGs presenting the highest fold change and lowest p-value of the statistical workflow output. The methodology can be generalized to other (non-model) species and various types of microarray platforms.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BCC Publishing (2023). Bioinformatics: Technologies and Global Markets [Dataset]. https://www.bccresearch.com/market-research/biotechnology/page3

Bioinformatics: Technologies and Global Markets

Explore at:
Dataset updated
Dec 7, 2023
Dataset authored and provided by
BCC Publishing
License

https://www.bccresearch.com/aboutus/terms-conditionshttps://www.bccresearch.com/aboutus/terms-conditions

Description

Explore BCC Research's comprehensive report on Bioinformatics technologies Market. This report aims to study current and historical market revenues can be estimated based on the services & platforms, solutions, and application type.

Search
Clear search
Close search
Google apps
Main menu