100+ datasets found
  1. Bioinformatics repository examples with good practices of using GitHub.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno (2023). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. http://doi.org/10.1371/journal.pcbi.1004947.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.

  2. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  3. f

    Data from: Advancing computational biology and bioinformatics research...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
    Explore at:
    Dataset updated
    Sep 27, 2019
    Authors
    Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
    Description

    Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.

  4. s

    MINUTE-ChIP example data

    • figshare.scilifelab.se
    txt
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Navarro Luzon; Simon Elsässer (2025). MINUTE-ChIP example data [Dataset]. http://doi.org/10.17044/scilifelab.25348405.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Carmen Navarro Luzon; Simon Elsässer
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This collection contains an example MINUTE-ChIP dataset to run minute pipeline on, provided as supporting material to help users understand the results of a MINUTE-ChIP experiment from raw data to a primary analysis that yields the relevant files for downstream analysis along with summarized QC indicators. Example primary non-demultiplexed FASTQ files provided here were used to generate GSM5493452-GSM5493463 (H3K27m3) and GSM5823907-GSM5823918 (Input), deposited on GEO with the minute pipeline all together under series GSE181241. For more information about MINUTE-ChIP, you can check the publication relevant to this dataset: Kumar, Banushree, et al. "Polycomb repressive complex 2 shields naïve human pluripotent cells from trophectoderm differentiation." Nature Cell Biology 24.6 (2022): 845-857. If you want more information about the minute pipeline, there is a public biorXiv and a GitHub repository and official documentation.

  5. r

    Data from: DNA metabarcoding captures subtle differences in forest beetle...

    • researchdata.edu.au
    Updated 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge (2020). DNA metabarcoding captures subtle differences in forest beetle communities following disturbance [Dataset]. https://researchdata.edu.au/dna-metabarcoding-captures-following-disturbance/1676001
    Explore at:
    Dataset updated
    2020
    Dataset provided by
    University of Tasmania, Australia
    Authors
    Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset includes all raw Miseq high-throughput sequencing data, bioinformatic pipeline and R codes that were used in the publication "Liu M, Baker SC, Burridge CP, Jordan GJ, Clarke LJ (2020) DNA metabarcoding captures subtle differences in forest beetle communities following disturbance. Restoration Ecology. 28:1475-1484. DOI:10.1111/rec.13236."

    Miseq_16S.zip - Miseq sequencing dataset for gene marker 16S, including 48 fastq files for 24 beetle bulk samples; Miseq_CO1.zip -Miseq sequencing dataset for gene marker CO1, including 46 fastq files for 23 beetle bulk samples (one sample failed to be sequenced); nfp4MBC.nf - A nextflow bioinformatic script to process Miseq datasets; nextflow.config - A configuratioin file needed when using nfp4MBC.nf; adapters_16S.zip - Adapters used to tag each of 24 beetle bulk samples for 16S, also used to process 16S Miseq dataset when using nfp4MBC.nf; adapters_CO1.zip - Adapters used to tag each of 24 beetle bulk samples for CO1, also used to process CO1 Miseq dataset when using nfp4MBC.nf; rMBC.Rmd - R markdown codes for community analyses; rMBC.zip - Datasets used in rMBC.Rmd. COI_ZOTUs_176.fasta - DNA sequences of 176 COI ZOTUs. 16S_ZOTUs_156 -DNA sequences of 156 16S ZOTUs.

  6. q

    Making toast: Using analogies to explore concepts in bioinformatics

    • qubeshub.org
    Updated Aug 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate Hertweck (2021). Making toast: Using analogies to explore concepts in bioinformatics [Dataset]. http://doi.org/10.24918/cs.2016.11
    Explore at:
    Dataset updated
    Aug 26, 2021
    Dataset provided by
    QUBES
    Authors
    Kate Hertweck
    Description

    Contemporary biology is moving towards heavy reliance on computational methods to manage, find patterns, and derive meaning from large-scale data, such as genomic sequences. Biology teachers are increasingly compelled to prepare students with skills to meet these challenges. However, introducing biology students to more abstract concepts associated with computational thinking remains a major challenge. Analogies have long been used in science classrooms to help students comprehend complex concepts by relating them to familiar processes. Here I present a multi-step procedure for introducing students to large-scale data analysis (bioinformatics workflows) by asking them to describe a common daily task: making toast. First, students describe the main steps associated with this procedure. Next, students are presented with alternative scenarios for materials and equipment and are asked to extend the analogy to accommodate them. Finally, students are led through examples of how the analogy breaks down, or fails to accurately represent, a bioinformatics analysis. This structured approach to student exploration of analogies related to computational biology capitalizes on diverse student experiences to both clarify concepts and ameliorate possible misconceptions. Similar methods can be used to introduce many abstract concepts in both biology and computer science.

  7. INSDC Environment Sample Sequences

    • gbif.org
    • demo.gbif.org
    • +1more
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
    Explore at:
    Dataset updated
    Nov 29, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    European Bioinformatics Institutehttp://www.ebi.ac.uk/
    Authors
    European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`

    EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

    The data was then processed as follows:

    1. Human sequences were excluded.

    2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

    3. Contigs and whole genome shotgun (WGS) records were added individually.

    4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

    5. The records associated with the same vouchers are aggregated together.

    6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

    7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

    More information available here: https://github.com/gbif/embl-adapter#readme

    You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

  8. Examples datasets for Microbiology

    • zenodo.org
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Cordier; Tristan Cordier (2020). Examples datasets for Microbiology [Dataset]. http://doi.org/10.5281/zenodo.2605445
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tristan Cordier; Tristan Cordier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three examples dataset to perform bioinformatics analysis.

  9. temporary examples

    • figshare.com
    xlsx
    Updated Dec 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Dale (2018). temporary examples [Dataset]. http://doi.org/10.6084/m9.figshare.7470083.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 15, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ryan Dale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example files to test URL handling

  10. R

    RMQS1 16S bioinformatic config files and control sample data

    • entrepot.recherche.data.gouv.fr
    application/gzip, tsv +1
    Updated Aug 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt (2024). RMQS1 16S bioinformatic config files and control sample data [Dataset]. http://doi.org/10.57745/XBFOJP
    Explore at:
    tsv(522347), txt(143493), tsv(8814), tsv(33093), tsv(117004), application/gzip(362535), tsv(13212), tsv(32344), tsv(266094), tsv(80032), txt(10413), tsv(16460)Available download formats
    Dataset updated
    Aug 22, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    French National Research Agency (ANR)
    France Génomique
    French Agency for Ecological Transition (ADEME)
    Description

    RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure: Taxonomy files rmqs1_control_taxonomy_: Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present: Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name. rmqs1_16S_otu_abundance.tsv: OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold). rmqs1_16S_bank_association.tsv: two columns file with bank name for each sample rmqs1_16S_bank_metadata.tsv: library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library Input_G4.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_G4.tab: Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Input_GLOBAL.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_GLOBAL.tab: Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Details: Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.

  11. f

    Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 11, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng (2014). “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001209841
    Explore at:
    Dataset updated
    Dec 11, 2014
    Authors
    Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng
    Description

    “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education

  12. Sample DNA Sequence

    • kaggle.com
    zip
    Updated Jan 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sreshta Putchala (2021). Sample DNA Sequence [Dataset]. https://www.kaggle.com/sreshta140/covid19-genome-sequence
    Explore at:
    zip(69652 bytes)Available download formats
    Dataset updated
    Jan 14, 2021
    Authors
    Sreshta Putchala
    Description

    Dataset

    This dataset was created by Sreshta Putchala

    Contents

  13. Bakta Annotation Examples

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Nov 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Schwengers; Oliver Schwengers (2021). Bakta Annotation Examples [Dataset]. http://doi.org/10.5281/zenodo.4922840
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 10, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Oliver Schwengers; Oliver Schwengers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data repository provides exemplary bacterial genome annotations conducted with Bakta of a broad taxonomical range of genomes comprising many pathogens (all ESKAPE), commensals and environmental species.

    Bakta is a tool for the rapid & standardized local annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readble JSON & bioinformatics standard file formats for automatic downstream analysis: https://github.com/oschwengers/bakta

  14. Example File 1.txt

    • figshare.com
    txt
    Updated Apr 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vafaee Lab (2020). Example File 1.txt [Dataset]. http://doi.org/10.6084/m9.figshare.12200138.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 27, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Vafaee Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example files to run DrugSimDB interface

  15. d

    metabarcoding data for: Benchmark of bioinformatics tools for fast and...

    • search.dataone.org
    • dataone.org
    • +1more
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laetitia Mathon (2025). metabarcoding data for: Benchmark of bioinformatics tools for fast and accurate species identification from environmental DNA metabarcoding [Dataset]. http://doi.org/10.5061/dryad.15dv41nx6
    Explore at:
    Dataset updated
    May 17, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Laetitia Mathon
    Time period covered
    Jan 1, 2021
    Description

    This dataset contains fish DNA sequences samples, simulated with Grinder, to build a mock community, as well as real fish eDNA metabarcoding data from the Mediterranean sea.

    These data have been used to compare the efficiency of different bioinformatic tools in retrieving the species composition of real and simulated samples.

  16. r

    18s_SSU identified sample library

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cathy Cavallo (2022). 18s_SSU identified sample library [Dataset]. http://doi.org/10.26180/5ea7d9b786c4e
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Cathy Cavallo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the identified sample library containing little penguin faecal samples with numbers of sequence reads for each taxon identified.

  17. m

    Data from: Consensus clustering of gene expression microarray data using...

    • bridges.monash.edu
    • researchdata.edu.au
    pdf
    Updated Nov 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mendes, Alexandre (2017). Consensus clustering of gene expression microarray data using genetic algorithms [Dataset]. http://doi.org/10.4225/03/5a13728358b1d
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 21, 2017
    Dataset provided by
    Monash University
    Authors
    Mendes, Alexandre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This work presents a new consensus clustering method for gene expression microarray data based on a genetic algorithm. Using two datasets - DA and DB - as input, the genetic algorithm examines putative partitions for the samples in DA, selecting biomarkers that support such partitions. The biomarkers are then used to build a classifier which is used in DB to determine its samples classes. The genetic algorithm is guided by an objective function that takes into account the accuracy of classification in both datasets, the number of biomarkers that support the partition, and the distribution of the samples across the classes for each dataset. To illustrate the method, two whole-genome breast cancer instances from dfferent sources were used. In this application, the results indicate that the method could be used to find unknown subtypes of diseases supported by biomarkers presenting similar gene expression profiles across platforms. Moreover, even though this initial study was restricted to two datasets and two classes, the method can be easily extended to consider both more datasets and classes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  18. f

    Data Sheet 1_Comprehensive bioinformatics analysis identifies metabolic and...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liang, Qianqian; Wang, Yide; Li, Zheng (2025). Data Sheet 1_Comprehensive bioinformatics analysis identifies metabolic and immune-related diagnostic biomarkers shared between diabetes and COPD using multi-omics and machine learning.zip [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001289675
    Explore at:
    Dataset updated
    Jan 8, 2025
    Authors
    Liang, Qianqian; Wang, Yide; Li, Zheng
    Description

    BackgroundDiabetes and chronic obstructive pulmonary disease (COPD) are prominent global health challenges, each imposing significant burdens on affected individuals, healthcare systems, and society. However, the specific molecular mechanisms supporting their interrelationship have not been fully defined.MethodsWe identified the differentially expressed genes (DEGs) of COPD and diabetes from multi-center patient cohorts, respectively. Through cross-analysis, we identified the shared DEGs of COPD and diabetes, and investigated alterations of signaling pathways using Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and gene set enrichment analysis (GSEA). By using weighted gene correlation network analysis (WGCNA), key gene modules for COPD and diabetes were identified, and various machine learning algorithms were employed to identify shared biomarkers. Using xCell, we investigated the relationship between shared biomarkers and immune infiltration in diabetes and COPD. Single-cell sequencing, clinical samples, and animal models were used to confirm the robustness of shared biomarkers.ResultsCross-analysis identified 186 shared DEGs between diabetes and COPD patients. Functional enrichment results demonstrate that metabolic and immune-related pathways are common features altered in both diabetes and COPD patients. WGCNA identified 526 genes from key gene modules in COPD and diabetes. Multiple machine learning algorithms identified 4 shared biomarkers for COPD and diabetes, including CADPS, EDNRB, THBS4 and TMEM27. Finally, the 4 shared biomarkers were validated in single-cell sequencing data, clinical samples, and animal models, and their expression changes were consistent with the results of bioinformatic analysis.ConclusionsThrough comprehensive bioinformatics analysis, we revealed the potential connection between diabetes and COPD, providing a theoretical basis for exploring the common regulatory genes.

  19. r

    Data from: Spectrum analysis based method for dynamics and collective...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu (2022). Spectrum analysis based method for dynamics and collective analysis of protein-protein interaction networks [Dataset]. http://doi.org/10.4225/03/5a13725619374
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu
    Description

    The importance of understanding biological interaction networks has fueled the development of numerous interaction data generation techniques, databases and prediction tools. Generation of high-confident interaction networks formulates the first step towards the study for protein–protein interactions (PPI). A number of experimental methods, based on distinct, physical principles have been developed to identify PPI such as the yeast two-hybrid method (Y2H). In this work, we focus on one example of biological networks, namely the yeast protein interaction network (YPIN). In YPIN, we design and implement a computational model that captures the discrete and stochastic nature of protein interactions. In this model, we apply spectrum analysis method to the variance of the protein nodes which play an important role in the PPI networks, which can show the topology structure of dynamic and collective performances of PPI networks. We take YPIN, such as 48 "quasi-cliques" and 6 "quasi-bipartites" separated from 11855 yeast PPI networks with 2617 proteins, as an example and apply spectrum analysis to show the topology structure of dynamic and collective analysis of PPI networks and the performances. The obtained results may be valuable for deciphering unknown protein functions, determining protein complexes, and inventing drugs. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  20. f

    DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    • +1more
    Updated Apr 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dai, Xinzhu; Pan, Zhixin; Jiang, Zhihang; Shao, Mengting; Liu, Dongmei (2022). DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus Erythematosus Based on Bioinformatics Analysis and Machine Learning.ZIP [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000226710
    Explore at:
    Dataset updated
    Apr 14, 2022
    Authors
    Dai, Xinzhu; Pan, Zhixin; Jiang, Zhihang; Shao, Mengting; Liu, Dongmei
    Description

    Systemic lupus erythematosus (SLE) is a complex autoimmune disease that affects several organs and causes variable clinical symptoms. Exploring new insights on genetic factors may help reveal SLE etiology and improve the survival of SLE patients. The current study is designed to identify key genes involved in SLE and develop potential diagnostic biomarkers for SLE in clinical practice. Expression data of all genes of SLE and control samples in GSE65391 and GSE72509 datasets were downloaded from the Gene Expression Omnibus (GEO) database. A total of 11 accurate differentially expressed genes (DEGs) were identified by the “limma” and “RobustRankAggreg” R package. All these genes were functionally associated with several immune-related biological processes and a single KEGG (Kyoto Encyclopedia of Genes and Genome) pathway of necroptosis. The PPI analysis showed that IFI44, IFI44L, EIF2AK2, IFIT3, IFITM3, ZBP1, TRIM22, PRIC285, XAF1, and PARP9 could interact with each other. In addition, the expression patterns of these DEGs were found to be consistent in GSE39088. Moreover, Receiver operating characteristic (ROC) curves analysis indicated that all these DEGs could serve as potential diagnostic biomarkers according to the area under the ROC curve (AUC) values. Furthermore, we constructed the transcription factor (TF)-diagnostic biomarker-microRNA (miRNA) network composed of 278 nodes and 405 edges, and a drug-diagnostic biomarker network consisting of 218 nodes and 459 edges. To investigate the relationship between diagnostic biomarkers and the immune system, we evaluated the immune infiltration landscape of SLE and control samples from GSE6539. Finally, using a variety of machine learning methods, IFI44 was determined to be the optimal diagnostic biomarker of SLE and then verified by quantitative real-time PCR (qRT-PCR) in an independent cohort. Our findings may benefit the diagnosis of patients with SLE and guide in developing novel targeted therapy in treating SLE patients.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno (2023). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. http://doi.org/10.1371/journal.pcbi.1004947.t001
Organization logo

Bioinformatics repository examples with good practices of using GitHub.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.

Search
Clear search
Close search
Google apps
Main menu