100+ datasets found

Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
s
MINUTE-ChIP example data
figshare.scilifelab.se
txt
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Navarro Luzon; Simon Elsässer (2025). MINUTE-ChIP example data [Dataset]. http://doi.org/10.17044/scilifelab.25348405.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.25348405.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Karolinska Institutet
Authors
Carmen Navarro Luzon; Simon Elsässer
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This collection contains an example MINUTE-ChIP dataset to run minute pipeline on, provided as supporting material to help users understand the results of a MINUTE-ChIP experiment from raw data to a primary analysis that yields the relevant files for downstream analysis along with summarized QC indicators. Example primary non-demultiplexed FASTQ files provided here were used to generate GSM5493452-GSM5493463 (H3K27m3) and GSM5823907-GSM5823918 (Input), deposited on GEO with the minute pipeline all together under series GSE181241. For more information about MINUTE-ChIP, you can check the publication relevant to this dataset: Kumar, Banushree, et al. "Polycomb repressive complex 2 shields naïve human pluripotent cells from trophectoderm differentiation." Nature Cell Biology 24.6 (2022): 845-857. If you want more information about the minute pipeline, there is a public biorXiv and a GitHub repository and official documentation.
Example dataset annotated with bacannot
figshare.com
application/x-gzip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felipe Almeida (2023). Example dataset annotated with bacannot [Dataset]. http://doi.org/10.6084/m9.figshare.14160590.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14160590.v1
Dataset updated
Jun 11, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Felipe Almeida
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
This dataset represents the https://github.com/PacificBiosciences/DevNet/wiki/8-plex-Ecoli-Multiplexed-Microbial-Assembly pacbio dataset already assembled and annotated so that users that want to skip some steps of the tutorial can do it by downloading this dataset.
DNA Classification dataset
kaggle.com
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arif Miah (2025). DNA Classification dataset [Dataset]. https://www.kaggle.com/datasets/miadul/dna-classification-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arif Miah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 3,000 synthetic DNA samples with 13 features designed for genomic data analysis, machine learning, and bioinformatics research. Each row represents a unique DNA sample with both sequence-level and statistical attributes.

🔹 Dataset Structure

Rows: 3,000

Columns: 13

🔹 Features Description

Sample_ID → Unique identifier for each DNA sample

Sequence → DNA sequence (string of A, T, C, G)

GC_Content → Percentage of Guanine (G) and Cytosine (C) in the sequence

AT_Content → Percentage of Adenine (A) and Thymine (T) in the sequence

Sequence_Length → Total sequence length

Num_A → Number of Adenine bases

Num_T → Number of Thymine bases

Num_C → Number of Cytosine bases

Num_G → Number of Guanine bases

kmer_3_freq → Average 3-mer (triplet) frequency score

Mutation_Flag → Binary flag indicating mutation presence (0 = No, 1 = Yes)

Class_Label → Class of the sample (Human, Bacteria, Virus, Plant)

Disease_Risk → Risk level associated with the sample (Low / Medium / High)

🔹 Potential Use Cases

DNA classification tasks (e.g., predicting species from DNA sequence features)

Exploratory Data Analysis (EDA) in bioinformatics

Machine Learning model development (Logistic Regression, Random Forest, SVM, Neural Networks)

Deep Learning approaches (LSTM, CNN, Transformers for sequence learning)

Mutation detection and disease risk analysis

Teaching and practicing biological data preprocessing techniques

🔹 Why This Dataset?

Synthetic but realistic structure, inspired by genomics data

Balanced and diverse distribution of features and labels

Suitable for beginners and researchers to practice classification, visualization, and model comparison

🔹 Example Research Questions

Can we classify DNA samples into their biological class using sequence-based features?

How does GC content relate to mutation risk?

Which ML model performs best for DNA classification tasks?

Can synthetic DNA features predict disease risk categories?

📌 Acknowledgment

This dataset is synthetic and generated for educational & research purposes. It does not represent real patient data.
INSDC Environment Sample Sequences
gbif.org
demo.gbif.org
+1more
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
Explore at:
Unique identifier
https://doi.org/10.15468/mcmd5g
Dataset updated
Nov 29, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Authors
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
r
Data from: DNA metabarcoding captures subtle differences in forest beetle...
researchdata.edu.au
Updated 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge (2020). DNA metabarcoding captures subtle differences in forest beetle communities following disturbance [Dataset]. https://researchdata.edu.au/dna-metabarcoding-captures-following-disturbance/1676001
Explore at:
Dataset updated
2020
Dataset provided by
University of Tasmania, Australia
Authors
Susan Baker; Laurence Clarke; Christopher Burridge; Greg Jordan; Mingxin Liu; Susan Baker; Mingxin Liu; Mingxin Liu; Laurence Clarke; Greg Jordan; Christopher Burridge
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset includes all raw Miseq high-throughput sequencing data, bioinformatic pipeline and R codes that were used in the publication "Liu M, Baker SC, Burridge CP, Jordan GJ, Clarke LJ (2020) DNA metabarcoding captures subtle differences in forest beetle communities following disturbance. Restoration Ecology. 28:1475-1484. DOI:10.1111/rec.13236."

Miseq_16S.zip - Miseq sequencing dataset for gene marker 16S, including 48 fastq files for 24 beetle bulk samples; Miseq_CO1.zip -Miseq sequencing dataset for gene marker CO1, including 46 fastq files for 23 beetle bulk samples (one sample failed to be sequenced); nfp4MBC.nf - A nextflow bioinformatic script to process Miseq datasets; nextflow.config - A configuratioin file needed when using nfp4MBC.nf; adapters_16S.zip - Adapters used to tag each of 24 beetle bulk samples for 16S, also used to process 16S Miseq dataset when using nfp4MBC.nf; adapters_CO1.zip - Adapters used to tag each of 24 beetle bulk samples for CO1, also used to process CO1 Miseq dataset when using nfp4MBC.nf; rMBC.Rmd - R markdown codes for community analyses; rMBC.zip - Datasets used in rMBC.Rmd. COI_ZOTUs_176.fasta - DNA sequences of 176 COI ZOTUs. 16S_ZOTUs_156 -DNA sequences of 156 16S ZOTUs.
Examples datasets for Microbiology
zenodo.org
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tristan Cordier; Tristan Cordier (2020). Examples datasets for Microbiology [Dataset]. http://doi.org/10.5281/zenodo.2605445
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2605445
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tristan Cordier; Tristan Cordier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three examples dataset to perform bioinformatics analysis.
Sample dataset for OnClass
figshare.com
zip
Updated Dec 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sheng wang (2019). Sample dataset for OnClass [Dataset]. http://doi.org/10.6084/m9.figshare.11340965.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11340965.v1
Dataset updated
Dec 11, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
sheng wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for OnClass
Metagenomes example datasets
figshare.com
application/gzip
Updated Jan 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kelly Hidalgo (2022). Metagenomes example datasets [Dataset]. http://doi.org/10.6084/m9.figshare.19015058.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19015058.v1
Dataset updated
Jan 24, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kelly Hidalgo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example dataset for bioinformatic tutorials
Example datasets for AlphaCRV
zenodo.org
application/gzip, bin
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco J. Guzmán-Vega; Francisco J. Guzmán-Vega (2024). Example datasets for AlphaCRV [Dataset]. http://doi.org/10.5281/zenodo.10470744
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10470744
Dataset updated
Jan 8, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francisco J. Guzmán-Vega; Francisco J. Guzmán-Vega
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Time period covered
Jan 8, 2024
Description
Example datasets for AlphaCRV: A Pipeline for Identifying Accurate Binder Topologies in Mass-Modeling with AlphaFold. With these datasets you can replicate two of the three examples shown in the paper, following along with the Jupyter notebooks in the GitHub repository at https://github.com/strubelab/AlphaCRV. Description:

AVRPia.fasta, AVRPik.fasta, SKP1.fasta: Sequences of the bait molecules for the three examples.

AVRPia_binders.fasta, AVRPik_binders.fasta, SKP1_binders.fasta: Sequences of the binder molecules for the three examples.

AVRPia_vs_rice_models.tar.lzma, AVRPik_vs_rice_models.tar.lzma: Compressed archives of the AlphaFold-Multimer models for the AVRPia and AVRPik examples. The 712 complexes for SKP1 are more than 100GB in size, so those can be provided upon request. To decompress the .tar.lzma archives use the following two commands on each:

unlzma AVRPia_vs_rice_models.tar.lzma tar -xvf AVRPia_vs_rice_models.tar

AVRPia_vs_rice_clusters.tar.gz, AVRPik_vs_rice_clusters.tar.gz, SKP1_vs_rice_clusters.tar.gz: Compressed archives with the results from running AlphaCRV on the three examples presented in the paper. To decompress these archives use the following command on each:

tar -xvzf AVRPia_vs_rice_clusters.tar.gz
f
Bioinformatics repository examples with good practices of using GitHub.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Oct 13, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pollard, Tom J.; da Veiga Leprevost, Felipe; Vizcaíno, Juan Antonio; Flight, Robert M.; Eglen, Stephen J.; Ternent, Tobias; Blin, Kai; Perez-Riverol, Yasset; Gatto, Laurent; Konovalov, Alexander; Wang, Rui; Uszkoreit, Julian; Sachsenberg, Timo; Fufezan, Christian; Katz, Daniel S. (2016). Bioinformatics repository examples with good practices of using GitHub. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001509925
Explore at:
Dataset updated
Oct 13, 2016
Authors
Pollard, Tom J.; da Veiga Leprevost, Felipe; Vizcaíno, Juan Antonio; Flight, Robert M.; Eglen, Stephen J.; Ternent, Tobias; Blin, Kai; Perez-Riverol, Yasset; Gatto, Laurent; Konovalov, Alexander; Wang, Rui; Uszkoreit, Julian; Sachsenberg, Timo; Fufezan, Christian; Katz, Daniel S.
Description
The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/.
DNA sequencing raw data and analytical results by bioinformatics for column...
catalog.data.gov
datasets.ai
+2more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). DNA sequencing raw data and analytical results by bioinformatics for column study on algal roganic matter impact. [Dataset]. https://catalog.data.gov/dataset/dna-sequencing-raw-data-and-analytical-results-by-bioinformatics-for-column-study-on-algal
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The excel spreadsheet includes sample IDs and labeling information for DNA sequencing raw data. In addition, DNA concentrations for all the biofilm samples analyzed are presented. This dataset is associated with the following publication: Jeon, Y., l. li, J. Calvillo, H. Ryu, J. Santo Domingo, O. Choi, J. Brown, and Y. Seo. Impact of algal organic matter on the performance, cyanotoxin removal, and biofilms of biologically-active filtration systems. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 184: 116120, (2020).
R
RMQS1 16S bioinformatic config files and control sample data
entrepot.recherche.data.gouv.fr
application/gzip, tsv +1
Updated Aug 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt (2024). RMQS1 16S bioinformatic config files and control sample data [Dataset]. http://doi.org/10.57745/XBFOJP
Explore at:
tsv(522347), txt(143493), tsv(8814), tsv(33093), tsv(117004), application/gzip(362535), tsv(13212), tsv(32344), tsv(266094), tsv(80032), txt(10413), tsv(16460)Available download formats
Unique identifier
https://doi.org/10.57745/XBFOJP
Dataset updated
Aug 22, 2024
Dataset provided by
Recherche Data Gouv
Authors
Sébastien Terrat; Sébastien Terrat; Samuel Dequiedt; Samuel Dequiedt
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
French National Research Agency (ANR)
France Génomique
French Agency for Ecological Transition (ADEME)
Description
RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure: Taxonomy files rmqs1_control_taxonomy_: Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present: Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name. rmqs1_16S_otu_abundance.tsv: OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold). rmqs1_16S_bank_association.tsv: two columns file with bank name for each sample rmqs1_16S_bank_metadata.tsv: library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library Input_G4.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_G4.tab: Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Input_GLOBAL.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_GLOBAL.tab: Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Details: Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.
Ensembl TSS dataset for GRCh38
zenodo.org
portalcienciaytecnologia.jcyl.es
+2more
bin
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7147597
Dataset updated
Aug 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.
Z
Scorpio Gene-Taxa Benchmark Dataset
data.niaid.nih.gov
zenodo.org
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Refahi, Mohammad Saleh (2025). Scorpio Gene-Taxa Benchmark Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12175912
Explore at:
Dataset updated
Apr 3, 2025
Dataset provided by
Drexel University
Authors
Refahi, Mohammad Saleh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.

To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.

We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.
f
Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Dec 11, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng (2014). “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001209841
Explore at:
Dataset updated
Dec 11, 2014
Authors
Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng
Description
“Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education
s
BAGS.v1.1: BAltic Gene Set gene catalogue
figshare.scilifelab.se
demo.researchdata.se
+2more
bin
Updated May 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis Fernando Delgado Zambrano; Anders Andersson (2025). BAGS.v1.1: BAltic Gene Set gene catalogue [Dataset]. http://doi.org/10.17044/scilifelab.16677252.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.16677252.v3
Dataset updated
May 12, 2025
Dataset provided by
KTH Royal Institute of Technology
Authors
Luis Fernando Delgado Zambrano; Anders Andersson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BAltic Gene Set gene catalogue v1.1 encompasses 66,530,673 genes.The 66 million genes are based on metagenomic data from Alneberg at al. (2020) from 124 seawater samples, that span the salinity and oxygen gradients of the Baltic Sea and capture seasonal dynamics at two locations. To obtain the gene catalogue, we used a mix-assembly approach described in Delgado et al. (2022).The gene catalogue has been functionally and taxonomically annotated, using the Mix-assembly Gene Catalog pipeline (https://github.com/EnvGen/mix_assembly_pipeline). The taxonomy annotation was performed using Mmseqs21 and CAT3.Here you find representative mix-assembly gene and protein sequences, and different types of annotations for the proteins. Also, contigs for the co-assembly are included (see Delgado et al. 2022), gene and protein sequences from each individual assembly and the co-assembly, and a table containing the genes in each of the clusters. See README for details.When using the BAGSv1.1 gene catalogue, please cite:1. Delgado LF, Andersson AF. Evaluating metagenomic assembly approaches for biome-specific gene catalogues. Microbiome 10, 72 (2022)2. Alneberg J, Bennke C, Beier S, Bunse C, Quince C, Ininbergs K, Riemann L, Ekman M, Jürgens K, Labrenz M, Pinhassi J, Andersson AF (2020) Ecosystem-wide metagenomic binning enables prediction of ecological niches from genomes. Commun Biol 3, 119 (2020)
f
DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus...
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Apr 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dai, Xinzhu; Pan, Zhixin; Jiang, Zhihang; Shao, Mengting; Liu, Dongmei (2022). DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus Erythematosus Based on Bioinformatics Analysis and Machine Learning.ZIP [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000226710
Explore at:
Dataset updated
Apr 14, 2022
Authors
Dai, Xinzhu; Pan, Zhixin; Jiang, Zhihang; Shao, Mengting; Liu, Dongmei
Description
Systemic lupus erythematosus (SLE) is a complex autoimmune disease that affects several organs and causes variable clinical symptoms. Exploring new insights on genetic factors may help reveal SLE etiology and improve the survival of SLE patients. The current study is designed to identify key genes involved in SLE and develop potential diagnostic biomarkers for SLE in clinical practice. Expression data of all genes of SLE and control samples in GSE65391 and GSE72509 datasets were downloaded from the Gene Expression Omnibus (GEO) database. A total of 11 accurate differentially expressed genes (DEGs) were identified by the “limma” and “RobustRankAggreg” R package. All these genes were functionally associated with several immune-related biological processes and a single KEGG (Kyoto Encyclopedia of Genes and Genome) pathway of necroptosis. The PPI analysis showed that IFI44, IFI44L, EIF2AK2, IFIT3, IFITM3, ZBP1, TRIM22, PRIC285, XAF1, and PARP9 could interact with each other. In addition, the expression patterns of these DEGs were found to be consistent in GSE39088. Moreover, Receiver operating characteristic (ROC) curves analysis indicated that all these DEGs could serve as potential diagnostic biomarkers according to the area under the ROC curve (AUC) values. Furthermore, we constructed the transcription factor (TF)-diagnostic biomarker-microRNA (miRNA) network composed of 278 nodes and 405 edges, and a drug-diagnostic biomarker network consisting of 218 nodes and 459 edges. To investigate the relationship between diagnostic biomarkers and the immune system, we evaluated the immune infiltration landscape of SLE and control samples from GSE6539. Finally, using a variety of machine learning methods, IFI44 was determined to be the optimal diagnostic biomarker of SLE and then verified by quantitative real-time PCR (qRT-PCR) in an independent cohort. Our findings may benefit the diagnosis of patients with SLE and guide in developing novel targeted therapy in treating SLE patients.
Sample DNA Sequence
kaggle.com
zip
Updated Jan 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sreshta Putchala (2021). Sample DNA Sequence [Dataset]. https://www.kaggle.com/sreshta140/covid19-genome-sequence
Explore at:
zip(69652 bytes)Available download formats
Dataset updated
Jan 14, 2021
Authors
Sreshta Putchala
Description
Dataset

This dataset was created by Sreshta Putchala

Contents
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Vaccine Trials Networkhttp://www.hvtn.org/
HIV Prevention Trials Networkhttp://www.hptn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:

zip(12928905 bytes)Available download formats

Dataset updated

Dec 27, 2024

Authors

Rafael Gallo

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.
Sequence: String of amino acids.
Molecular_Weight: Molecular weight calculated from the sequence.
Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
Hydrophobicity: Average hydrophobicity calculated from the sequence.
Total_Charge: Sum of the charges of the amino acids in the sequence.
Polar_Proportion: Percentage of polar amino acids in the sequence.
Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
Sequence_Length: Total number of amino acids in the sequence.
Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
Property Calculation: Physicochemical properties were calculated using the Biopython library.
Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Clear search

Close search

Google apps

Main menu

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

MINUTE-ChIP example data

Example dataset annotated with bacannot

DNA Classification dataset

INSDC Environment Sample Sequences

Data from: DNA metabarcoding captures subtle differences in forest beetle...

Examples datasets for Microbiology

Sample dataset for OnClass

Metagenomes example datasets

Example datasets for AlphaCRV

Bioinformatics repository examples with good practices of using GitHub.

DNA sequencing raw data and analytical results by bioinformatics for column...

RMQS1 16S bioinformatic config files and control sample data

Ensembl TSS dataset for GRCh38

Scorpio Gene-Taxa Benchmark Dataset

Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...

BAGS.v1.1: BAltic Gene Set gene catalogue

DataSheet1_Identification of Diagnostic Biomarkers in Systemic Lupus...

Sample DNA Sequence

Dataset

Contents

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Bioinformatics Protein Dataset - SimulatedSee More Versions

Synthetic protein dataset with sequences, physical properties, and functional cl

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatics Protein Dataset - Simulated