100+ datasets found

Bioinformatics repository examples with good practices of using GitHub.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno (2023). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. http://doi.org/10.1371/journal.pcbi.1004947.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1004947.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.
Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
f
Data from: Advancing computational biology and bioinformatics research...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind (2019). Advancing computational biology and bioinformatics research through open innovation competitions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000064443
Explore at:
Dataset updated
Sep 27, 2019
Authors
Jonchhe, Anup; Su, Andrew I.; Natoli, Ted; Macaluso, N. J. Maximilian; Briney, Bryan; Blasco, Andrea; Narayan, Rajiv; Lakhani, Karim R.; Paik, Jin H.; Endres, Michael G.; Sergeev, Rinat A.; Wu, Chunlei; Subramanian, Aravind
Description
Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research in which the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research.
s
MINUTE-ChIP example data
figshare.scilifelab.se
txt
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Navarro Luzon; Simon Elsässer (2025). MINUTE-ChIP example data [Dataset]. http://doi.org/10.17044/scilifelab.25348405.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.25348405.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Karolinska Institutet
Authors
Carmen Navarro Luzon; Simon Elsässer
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This collection contains an example MINUTE-ChIP dataset to run minute pipeline on, provided as supporting material to help users understand the results of a MINUTE-ChIP experiment from raw data to a primary analysis that yields the relevant files for downstream analysis along with summarized QC indicators. Example primary non-demultiplexed FASTQ files provided here were used to generate GSM5493452-GSM5493463 (H3K27m3) and GSM5823907-GSM5823918 (Input), deposited on GEO with the minute pipeline all together under series GSE181241. For more information about MINUTE-ChIP, you can check the publication relevant to this dataset: Kumar, Banushree, et al. "Polycomb repressive complex 2 shields naïve human pluripotent cells from trophectoderm differentiation." Nature Cell Biology 24.6 (2022): 845-857. If you want more information about the minute pipeline, there is a public biorXiv and a GitHub repository and official documentation.
r
Data from: DNA metabarcoding captures subtle differences in forest beetle...
researchdata.edu.au
Updated 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge (2020). DNA metabarcoding captures subtle differences in forest beetle communities following disturbance [Dataset]. https://researchdata.edu.au/dna-metabarcoding-captures-following-disturbance/1676001
Explore at:
Dataset updated
2020
Dataset provided by
University of Tasmania, Australia
Authors
Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset includes all raw Miseq high-throughput sequencing data, bioinformatic pipeline and R codes that were used in the publication "Liu M, Baker SC, Burridge CP, Jordan GJ, Clarke LJ (2020) DNA metabarcoding captures subtle differences in forest beetle communities following disturbance. Restoration Ecology. 28:1475-1484. DOI:10.1111/rec.13236."

Miseq_16S.zip - Miseq sequencing dataset for gene marker 16S, including 48 fastq files for 24 beetle bulk samples; Miseq_CO1.zip -Miseq sequencing dataset for gene marker CO1, including 46 fastq files for 23 beetle bulk samples (one sample failed to be sequenced); nfp4MBC.nf - A nextflow bioinformatic script to process Miseq datasets; nextflow.config - A configuratioin file needed when using nfp4MBC.nf; adapters_16S.zip - Adapters used to tag each of 24 beetle bulk samples for 16S, also used to process 16S Miseq dataset when using nfp4MBC.nf; adapters_CO1.zip - Adapters used to tag each of 24 beetle bulk samples for CO1, also used to process CO1 Miseq dataset when using nfp4MBC.nf; rMBC.Rmd - R markdown codes for community analyses; rMBC.zip - Datasets used in rMBC.Rmd. COI_ZOTUs_176.fasta - DNA sequences of 176 COI ZOTUs. 16S_ZOTUs_156 -DNA sequences of 156 16S ZOTUs.
q
Making toast: Using analogies to explore concepts in bioinformatics
qubeshub.org
Updated Aug 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kate Hertweck (2021). Making toast: Using analogies to explore concepts in bioinformatics [Dataset]. http://doi.org/10.24918/cs.2016.11
Explore at:
Unique identifier
https://doi.org/10.24918/cs.2016.11
Dataset updated
Aug 26, 2021
Dataset provided by
QUBES
Authors
Kate Hertweck
Description
Contemporary biology is moving towards heavy reliance on computational methods to manage, find patterns, and derive meaning from large-scale data, such as genomic sequences. Biology teachers are increasingly compelled to prepare students with skills to meet these challenges. However, introducing biology students to more abstract concepts associated with computational thinking remains a major challenge. Analogies have long been used in science classrooms to help students comprehend complex concepts by relating them to familiar processes. Here I present a multi-step procedure for introducing students to large-scale data analysis (bioinformatics workflows) by asking them to describe a common daily task: making toast. First, students describe the main steps associated with this procedure. Next, students are presented with alternative scenarios for materials and equipment and are asked to extend the analogy to accommodate them. Finally, students are led through examples of how the analogy breaks down, or fails to accurately represent, a bioinformatics analysis. This structured approach to student exploration of analogies related to computational biology capitalizes on diverse student experiences to both clarify concepts and ameliorate possible misconceptions. Similar methods can be used to introduce many abstract concepts in both biology and computer science.
INSDC Environment Sample Sequences
gbif.org
demo.gbif.org
+1more
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
Explore at:
Unique identifier
https://doi.org/10.15468/mcmd5g
Dataset updated
Nov 29, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Authors
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
Examples datasets for Microbiology
zenodo.org
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tristan Cordier; Tristan Cordier (2020). Examples datasets for Microbiology [Dataset]. http://doi.org/10.5281/zenodo.2605445
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2605445
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tristan Cordier; Tristan Cordier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three examples dataset to perform bioinformatics analysis.
temporary examples
figshare.com
xlsx
Updated Dec 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Dale (2018). temporary examples [Dataset]. http://doi.org/10.6084/m9.figshare.7470083.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7470083.v1
Dataset updated
Dec 15, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ryan Dale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example files to test URL handling
R
RMQS1 16S bioinformatic config files and control sample data
entrepot.recherche.data.gouv.fr
application/gzip, tsv +1
Updated Aug 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt (2024). RMQS1 16S bioinformatic config files and control sample data [Dataset]. http://doi.org/10.57745/XBFOJP
Explore at:
tsv(522347), txt(143493), tsv(8814), tsv(33093), tsv(117004), application/gzip(362535), tsv(13212), tsv(32344), tsv(266094), tsv(80032), txt(10413), tsv(16460)Available download formats
Unique identifier
https://doi.org/10.57745/XBFOJP
Dataset updated
Aug 22, 2024
Dataset provided by
Recherche Data Gouv
Authors
Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
French National Research Agency (ANR)
France Génomique
French Agency for Ecological Transition (ADEME)
Description
RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure: Taxonomy files rmqs1_control_taxonomy_: Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present: Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name. rmqs1_16S_otu_abundance.tsv: OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold). rmqs1_16S_bank_association.tsv: two columns file with bank name for each sample rmqs1_16S_bank_metadata.tsv: library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library Input_G4.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_G4.tab: Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Input_GLOBAL.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_GLOBAL.tab: Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Details: Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.
f
Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Dec 11, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng (2014). “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001209841
Explore at:
Dataset updated
Dec 11, 2014
Authors
Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng
Description
“Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education
Sample DNA Sequence
kaggle.com
zip
Updated Jan 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sreshta Putchala (2021). Sample DNA Sequence [Dataset]. https://www.kaggle.com/sreshta140/covid19-genome-sequence
Explore at:
zip(69652 bytes)Available download formats
Dataset updated
Jan 14, 2021
Authors
Sreshta Putchala
Description
Dataset

This dataset was created by Sreshta Putchala

Contents
Bakta Annotation Examples
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Nov 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Schwengers; Oliver Schwengers (2021). Bakta Annotation Examples [Dataset]. http://doi.org/10.5281/zenodo.4922840
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4922840
Dataset updated
Nov 10, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Oliver Schwengers; Oliver Schwengers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data repository provides exemplary bacterial genome annotations conducted with Bakta of a broad taxonomical range of genomes comprising many pathogens (all ESKAPE), commensals and environmental species.

Bakta is a tool for the rapid & standardized local annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readble JSON & bioinformatics standard file formats for automatic downstream analysis: https://github.com/oschwengers/bakta
Example File 1.txt
figshare.com
txt
Updated Apr 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vafaee Lab (2020). Example File 1.txt [Dataset]. http://doi.org/10.6084/m9.figshare.12200138.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12200138.v1
Dataset updated
Apr 27, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vafaee Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example files to run DrugSimDB interface
d
metabarcoding data for: Benchmark of bioinformatics tools for fast and...
search.dataone.org
dataone.org
+1more
Updated May 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laetitia Mathon (2025). metabarcoding data for: Benchmark of bioinformatics tools for fast and accurate species identification from environmental DNA metabarcoding [Dataset]. http://doi.org/10.5061/dryad.15dv41nx6
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.15dv41nx6
Dataset updated
May 17, 2025
Dataset provided by
Dryad Digital Repository
Authors
Laetitia Mathon
Time period covered
Jan 1, 2021
Description
This dataset contains fish DNA sequences samples, simulated with Grinder, to build a mock community, as well as real fish eDNA metabarcoding data from the Mediterranean sea.

These data have been used to compare the efficiency of different bioinformatic tools in retrieving the species composition of real and simulated samples.
r
18s_SSU identified sample library
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cathy Cavallo (2022). 18s_SSU identified sample library [Dataset]. http://doi.org/10.26180/5ea7d9b786c4e
Explore at:
Unique identifier
https://doi.org/10.26180/5ea7d9b786c4e
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Cathy Cavallo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the identified sample library containing little penguin faecal samples with numbers of sequence reads for each taxon identified.
m
Data from: Consensus clustering of gene expression microarray data using...
bridges.monash.edu
researchdata.edu.au
pdf
Updated Nov 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mendes, Alexandre (2017). Consensus clustering of gene expression microarray data using genetic algorithms [Dataset]. http://doi.org/10.4225/03/5a13728358b1d
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5a13728358b1d
Dataset updated
Nov 21, 2017
Dataset provided by
Monash University
Authors
Mendes, Alexandre
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
This work presents a new consensus clustering method for gene expression microarray data based on a genetic algorithm. Using two datasets - DA and DB - as input, the genetic algorithm examines putative partitions for the samples in DA, selecting biomarkers that support such partitions. The biomarkers are then used to build a classifier which is used in DB to determine its samples classes. The genetic algorithm is guided by an objective function that takes into account the accuracy of classification in both datasets, the number of biomarkers that support the partition, and the distribution of the samples across the classes for each dataset. To illustrate the method, two whole-genome breast cancer instances from dfferent sources were used. In this application, the results indicate that the method could be used to find unknown subtypes of diseases supported by biomarkers presenting similar gene expression profiles across platforms. Moreover, even though this initial study was restricted to two datasets and two classes, the method can be easily extended to consider both more datasets and classes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
f
Data Sheet 1_Comprehensive bioinformatics analysis identifies metabolic and...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jan 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liang, Qianqian; Wang, Yide; Li, Zheng (2025). Data Sheet 1_Comprehensive bioinformatics analysis identifies metabolic and immune-related diagnostic biomarkers shared between diabetes and COPD using multi-omics and machine learning.zip [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001289675
Explore at:
Dataset updated
Jan 8, 2025
Authors
Liang, Qianqian; Wang, Yide; Li, Zheng
Description
BackgroundDiabetes and chronic obstructive pulmonary disease (COPD) are prominent global health challenges, each imposing significant burdens on affected individuals, healthcare systems, and society. However, the specific molecular mechanisms supporting their interrelationship have not been fully defined.MethodsWe identified the differentially expressed genes (DEGs) of COPD and diabetes from multi-center patient cohorts, respectively. Through cross-analysis, we identified the shared DEGs of COPD and diabetes, and investigated alterations of signaling pathways using Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and gene set enrichment analysis (GSEA). By using weighted gene correlation network analysis (WGCNA), key gene modules for COPD and diabetes were identified, and various machine learning algorithms were employed to identify shared biomarkers. Using xCell, we investigated the relationship between shared biomarkers and immune infiltration in diabetes and COPD. Single-cell sequencing, clinical samples, and animal models were used to confirm the robustness of shared biomarkers.ResultsCross-analysis identified 186 shared DEGs between diabetes and COPD patients. Functional enrichment results demonstrate that metabolic and immune-related pathways are common features altered in both diabetes and COPD patients. WGCNA identified 526 genes from key gene modules in COPD and diabetes. Multiple machine learning algorithms identified 4 shared biomarkers for COPD and diabetes, including CADPS, EDNRB, THBS4 and TMEM27. Finally, the 4 shared biomarkers were validated in single-cell sequencing data, clinical samples, and animal models, and their expression changes were consistent with the results of bioinformatic analysis.ConclusionsThrough comprehensive bioinformatics analysis, we revealed the potential connection between diabetes and COPD, providing a theoretical basis for exploring the common regulatory genes.
r
Data from: Spectrum analysis based method for dynamics and collective...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu (2022). Spectrum analysis based method for dynamics and collective analysis of protein-protein interaction networks [Dataset]. http://doi.org/10.4225/03/5a13725619374
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a13725619374
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu
Description
The importance of understanding biological interaction networks has fueled the development of numerous interaction data generation techniques, databases and prediction tools. Generation of high-confident interaction networks formulates the first step towards the study for protein–protein interactions (PPI). A number of experimental methods, based on distinct, physical principles have been developed to identify PPI such as the yeast two-hybrid method (Y2H). In this work, we focus on one example of biological networks, namely the yeast protein interaction network (YPIN). In YPIN, we design and implement a computational model that captures the discrete and stochastic nature of protein interactions. In this model, we apply spectrum analysis method to the variance of the protein nodes which play an important role in the PPI networks, which can show the topology structure of dynamic and collective performances of PPI networks. We take YPIN, such as 48 "quasi-cliques" and 6 "quasi-bipartites" separated from 11855 yeast PPI networks with 2617 proteins, as an example and apply spectrum analysis to show the topology structure of dynamic and collective analysis of PPI networks and the performances. The obtained results may be valuable for deciphering unknown protein functions, determining protein complexes, and inventing drugs. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
f
DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus...
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Apr 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dai, Xinzhu; Pan, Zhixin; Jiang, Zhihang; Shao, Mengting; Liu, Dongmei (2022). DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus Erythematosus Based on Bioinformatics Analysis and Machine Learning.ZIP [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000226710
Explore at:
Dataset updated
Apr 14, 2022
Authors
Dai, Xinzhu; Pan, Zhixin; Jiang, Zhihang; Shao, Mengting; Liu, Dongmei
Description
Systemic lupus erythematosus (SLE) is a complex autoimmune disease that affects several organs and causes variable clinical symptoms. Exploring new insights on genetic factors may help reveal SLE etiology and improve the survival of SLE patients. The current study is designed to identify key genes involved in SLE and develop potential diagnostic biomarkers for SLE in clinical practice. Expression data of all genes of SLE and control samples in GSE65391 and GSE72509 datasets were downloaded from the Gene Expression Omnibus (GEO) database. A total of 11 accurate differentially expressed genes (DEGs) were identified by the “limma” and “RobustRankAggreg” R package. All these genes were functionally associated with several immune-related biological processes and a single KEGG (Kyoto Encyclopedia of Genes and Genome) pathway of necroptosis. The PPI analysis showed that IFI44, IFI44L, EIF2AK2, IFIT3, IFITM3, ZBP1, TRIM22, PRIC285, XAF1, and PARP9 could interact with each other. In addition, the expression patterns of these DEGs were found to be consistent in GSE39088. Moreover, Receiver operating characteristic (ROC) curves analysis indicated that all these DEGs could serve as potential diagnostic biomarkers according to the area under the ROC curve (AUC) values. Furthermore, we constructed the transcription factor (TF)-diagnostic biomarker-microRNA (miRNA) network composed of 278 nodes and 405 edges, and a drug-diagnostic biomarker network consisting of 218 nodes and 459 edges. To investigate the relationship between diagnostic biomarkers and the immune system, we evaluated the immune infiltration landscape of SLE and control samples from GSE6539. Finally, using a variety of machine learning methods, IFI44 was determined to be the optimal diagnostic biomarker of SLE and then verified by quantitative real-time PCR (qRT-PCR) in an independent cohort. Our findings may benefit the diagnosis of patients with SLE and guide in developing novel targeted therapy in treating SLE patients.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J. Eglen; Daniel S. Katz; Tom J. Pollard; Alexander Konovalov; Robert M. Flight; Kai Blin; Juan Antonio Vizcaíno (2023). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. http://doi.org/10.1371/journal.pcbi.1004947.t001

Bioinformatics repository examples with good practices of using GitHub.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pcbi.1004947.t001

Dataset updated

May 31, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.

Clear search

Close search

Google apps

Main menu

Bioinformatics repository examples with good practices of using GitHub.

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Data from: Advancing computational biology and bioinformatics research...

MINUTE-ChIP example data

Data from: DNA metabarcoding captures subtle differences in forest beetle...

Making toast: Using analogies to explore concepts in bioinformatics

INSDC Environment Sample Sequences

Examples datasets for Microbiology

temporary examples

RMQS1 16S bioinformatic config files and control sample data

Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...

Sample DNA Sequence

Dataset

Contents

Bakta Annotation Examples

Example File 1.txt

metabarcoding data for: Benchmark of bioinformatics tools for fast and...

18s_SSU identified sample library

Data from: Consensus clustering of gene expression microarray data using...

Data Sheet 1_Comprehensive bioinformatics analysis identifies metabolic and...

Data from: Spectrum analysis based method for dynamics and collective...

DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus...

Bioinformatics repository examples with good practices of using GitHub.