CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Subjective data models dataset
This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".
Todo: link paper/preprint once published.
Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis
Files
Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview.
Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase.
Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review.
anonymous_participant_list.csv
shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set.
cards.txt
is a full list of the cards presented in the tasks.
background survey
and background manual annotations
are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text.
codes.csv
shows the qualitative codes used within the transcripts.
entry_point.csv
is a record of participants' identified entry points into the data.
file_mapping_responses
shows a record of responses to the file mapping task.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification. The dataset includes the following columns: ID_Protein, a unique identifier for each protein; Sequence, a string of amino acids; Molecular_Weight, molecular weight calculated from the sequence; Isoelectric_Point, estimated isoelectric point based on the sequence composition; Hydrophobicity, average hydrophobicity calculated from the sequence; Total_Charge, sum of the charges of the amino acids in the sequence; Polar_Proportion, percentage of polar amino acids in the sequence; Nonpolar_Proportion, percentage of nonpolar amino acids in the sequence; Sequence_Length, total number of amino acids in the sequence; and Class, the functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other. While this is a simulated dataset, it was inspired by patterns observed in real protein datasets such as UniProt, a comprehensive database of protein sequences and annotations; the Kyte-Doolittle Scale, calculations of hydrophobicity; and Biopython, a tool for analyzing biological sequences. This dataset is ideal for training classification models for proteins, exploratory analysis of physicochemical properties of proteins, and building machine learning pipelines in bioinformatics. The dataset was created through sequence generation, where amino acid chains were randomly generated with lengths between 50 and 300 residues, property calculation using the Biopython library, and class assignment with classes randomly assigned for classification purposes. However, the sequences and properties do not represent real proteins but follow patterns observed in natural proteins, and the functional classes are simulated and do not correspond to actual biological characteristics. The dataset is divided into two subsets: Training, which includes 16,000 samples (proteinas_train.csv), and Testing, which includes 4,000 samples (proteinas_test.csv). This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.
A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.
import pandas as pd
train_data = pd.read_pickle("train_data.pkl")
validation_data = pd.read_pickle("validation_data.pkl")
test_data = pd.read_pickle("test_data.pkl")
The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.
The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.
The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.
The label column is whether the two proteins bind. 0 = No. 1 = Yes.
The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.
From the TDC website:
T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.
Weber et al.
Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.
Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.
Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.
References:
Dataset License: CC BY 4.0.
Contributed by: Anna Weber and Jannis Born.
Checkpoint name | Number of layers | Number of parameters |
esm2_t48_15B_UR50D | 48 | 15B |
esm2_t36_3B_UR50D | 36 | 3B |
esm2_t33_650M_UR50D | 33 | 650M |
esm2_t30_150M_UR50D | 30 | 150M |
esm2_t12_35M_UR50D | 12 | 35M |
esm2_t6_8M_UR50D | 6 | 8M |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
grouping supposed in each model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data is for the paper titled "DeepHelicon: accurate prediction of inter-helical residue contacts in transmembrane protein by residual neural networks". It contains four sub-folders as follows: 1. Fasta: the protein sequences in the TRAIN, PREVIOUS, and TEST datasets, respectively. 2. PDB: the protein native structures in the TRAIN, PREVIOUS, and TEST datasets, respectively. 3. Predictions: the contact predictions on the PREVIOUS and TEST datasets, which are predicted by the contact prediction methods mentioned in the DeepHelicon paper. 4. 3D modelling: the 3D models, which are guided by the secondary structures predicted by SCRATCH1D and guided by the residue contacts predicted by DeepHelicon and DeepMetaPSICOV, respectively, are finally generated by CONFOLD2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The median, upper and lower quartiles of the adequacies of linear fitting versus bilinear fitting, bilinear fitting versus paraboloid fitting and bilinear fitting versus cubic spline function fitting are given. The right hand column shows the percentage of the families whose adequacies are above 0.9 (33/73, 71/73, 63/73, 22/75, 74/75, respectively).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We included files used in hallamlab/chap
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AlphaFold is used to regenerate the antibody structures in BM5 starting only from the sequence and is not allowed to use any templates from the PDB. This dataset contains the top predicted models from these runs for each antibody before any relaxation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Protvec model trained using 425,000 sequences from the Genome Taxonomy Database (GTDB). Sequences were dereplicated at 70% using CDHIT and filtered to remove sequences containing 'X', sequences shorter than 30 amino acids and sequences longer than 1024 amino acids.
Training used a vector size of 100 and a context size of 25.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(DREM³L) Drug REpurposing using Mechanistic Models of signal transduction and Machine Learning
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of the training, optimization, and testing sets used for developing the CATHe model, which is a deep learning framework capable of detecting extremely remote homologues (< 20% sequence identity) for CATH superfamilies. Additionally, the training weights for the artificial neural network present in the CATHe model have been provided.
Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks. By leveraging a taxonomy-guided ‘group-shuffle-split’ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All scripts for predictions and analysis are available from https://github.com/ElofssonLab/bioinfo-toolbox/trRosetta/Details for each run are available from https://github.com/ElofssonLab/bioinfo-toolbox/benchmark5/benchmark4.3/.All models joined alignments, and evaluation results are available from a figshare repository[44].The data is organized as follows1) One diretora (N*/ as well as ./) contains all the results and data for one set of parameters2) In each directory the following subdirectories are included2a) seq/ (all sequences)2b) pdb/ (all orginal pdb files) 2c) dimer/ all merged msa files2d) pymodel/ all models generated and the measuremenst (in csv files) to evalute their performance.3) In the director Figures/ all figures, scripts to generat them as well as summary of all predictions in a csv files is included
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WGBS bigBed format
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Environmental temperatures influence ectotherms' physiology and capacity to perform activities necessary for survival and reproduction. Time available to perform those activities is determined by thermal tolerances and environmental temperatures. Estimates of activity time might enhance our ability to predict suitable areas for species' persistence in face of climate warming, compared to the exclusive use of environmental temperatures, without considering thermal tolerances. We compare the ability of environmental temperatures and estimates of activity time to predict the geographic distribution of a tropical lizard, Tropidurus torquatus. We compared 105 estimates of activity time, resulting from the combination of four methodological decisions: (1) How to estimate daily environmental temperature variation (modeling a sinusoid wave ranging from monthly minimum to maximum temperature, extrapolating from operative temperatures measured in field or using biophysical projections of microclimate)? (2) In which temperature range are animals considered active? (3) Should these ranges be determined from body temperatures obtained in laboratory or in field? and (4) Should thermoregulation simulations be included in estimations? We show that models using estimates of activity time made with the sinusoid and biophysical methods had higher predictive accuracy than those using environmental temperatures alone. Estimates made using the central 90% of temperatures measured in a thermal gradient as the temperature range for activity also ranked higher than environmental temperatures. Thermoregulation simulations did not improve model accuracy. Precipitation ranked higher than thermally related predictors. Activity time adds important information to distribution modeling and should be considered as a predictor in studies of the distribution of ectotherms. The distribution of T. torquatus is restricted by precipitation and by the effect of lower temperatures on their time of activty and climate warming could lead to range expansion. We provide an R package "Mapinguari" with tools to generate spatial predictors based on the processes described herein.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the two custom-trained Guppy models used in the study: custom-Kp and custom-Kp-big-net.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pretrian Gaussian embeddings for Set2Gaussian.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training data for the "Drug REpurposing using eXplainable Machine Learning and Mechanistic Models of signal transduction" (DRExM³L) package.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scorpions have evolved venoms with diverse variety of toxins with a plethora of biological targets, but characterizing the evolution of this molecular diversity has been limited by the lack of a comprehensive phylogenetic hypothesis of scorpion relationships. Here, we elucidate the origin of calcins and lambda potassium channel toxins (LKTx), a type of inhibitor cystine knot (ICK) toxins. In this study, we compiled the most comprehensive sampling of ICK peptide homologs across the breath of scorpion diversity. Using a de novo phylogenomic tree, and morphometric analysis of 3D molecular models of these peptides, we traced the evolutionary change in the shape of these peptides. The discovery of phylogenetic signal in 3D molecular models of calcins and LKTx, provides the first synapomorphies for the two extant scorpion parvorders, which are otherwise morphologically diverse and undefined. Our morphometric analyses further reveal that calcins and LKTx evolved different shapes with minimal overlap in morphospace, supporting their characterization as stable (non-convergent) synapomorphies.
Different from significant gene expression analysis which looks for all genes that are differentially regulated, feature selection in prognostic gene expression analysis aims at finding a subset of informative marker genes that are discriminative for prediction. Unfortunately feature selection in the literature of microarray study is predominated by the simple heuristic univariate gene filter paradigm that selects differentially expressed genes according to their statistical significance. Since the univariate approach does not take into account the correlated or interactive structure among the genes, classifiers built on genes so selected can be less accurate. More advanced approaches based on multivariate models have to be considered. Here, we introduce a feature ranking method through forward orthogonal search to assist prognostic gene selection. Application to published gene-lists selected by univariate models shows that the feature space can be largely reduced while achieving improved testing performances. Our results indicate that "significant" features selected using the gene-wised approaches can contain irrelevant genes that only serve to complicate model building. Multivariate feature ranking can help to reduce feature redundancy and to select highly informative prognostic marker genes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Subjective data models dataset
This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".
Todo: link paper/preprint once published.
Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis
Files
Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview.
Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase.
Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review.
anonymous_participant_list.csv
shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set.
cards.txt
is a full list of the cards presented in the tasks.
background survey
and background manual annotations
are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text.
codes.csv
shows the qualitative codes used within the transcripts.
entry_point.csv
is a record of participants' identified entry points into the data.
file_mapping_responses
shows a record of responses to the file mapping task.