100+ datasets found

m
Research data for "Subjective data models in bioinformatics: Do wet-lab and...
figshare.manchester.ac.uk
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer (2023). Research data for "Subjective data models in bioinformatics: Do wet-lab and computational biologists comprehend data differently?" [Dataset]. http://doi.org/10.48420/20641017.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.48420/20641017.v2
Dataset updated
Jun 1, 2023
Dataset provided by
University of Manchester
Authors
Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Subjective data models dataset

This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".

Todo: link paper/preprint once published.

Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis

Files

Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview. Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase. Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review. anonymous_participant_list.csv shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set. cards.txt is a full list of the cards presented in the tasks. background survey and background manual annotations are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text. codes.csv shows the qualitative codes used within the transcripts. entry_point.csv is a record of participants' identified entry points into the data. file_mapping_responses shows a record of responses to the file mapping task.
Bioinformatics Simulated
kaggle.com
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
willian oliveira gibin (2025). Bioinformatics Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10398445
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10398445
Dataset updated
Jan 7, 2025
Dataset provided by
Kaggle
Authors
willian oliveira gibin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification. The dataset includes the following columns: ID_Protein, a unique identifier for each protein; Sequence, a string of amino acids; Molecular_Weight, molecular weight calculated from the sequence; Isoelectric_Point, estimated isoelectric point based on the sequence composition; Hydrophobicity, average hydrophobicity calculated from the sequence; Total_Charge, sum of the charges of the amino acids in the sequence; Polar_Proportion, percentage of polar amino acids in the sequence; Nonpolar_Proportion, percentage of nonpolar amino acids in the sequence; Sequence_Length, total number of amino acids in the sequence; and Class, the functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other. While this is a simulated dataset, it was inspired by patterns observed in real protein datasets such as UniProt, a comprehensive database of protein sequences and annotations; the Kyte-Doolittle Scale, calculations of hydrophobicity; and Biopython, a tool for analyzing biological sequences. This dataset is ideal for training classification models for proteins, exploratory analysis of physicochemical properties of proteins, and building machine learning pipelines in bioinformatics. The dataset was created through sequence generation, where amino acid chains were randomly generated with lengths between 50 and 300 residues, property calculation using the Biopython library, and class assignment with classes randomly assigned for classification purposes. However, the sequences and properties do not represent real proteins but follow patterns observed in natural proteins, and the functional classes are simulated and do not correspond to actual biological characteristics. The dataset is divided into two subsets: Training, which includes 16,000 samples (proteinas_train.csv), and Testing, which includes 4,000 samples (proteinas_test.csv). This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task
zenodo.org
data.niaid.nih.gov
bin
Updated Jun 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tony Reina; Tony Reina (2024). ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task [Dataset]. http://doi.org/10.5281/zenodo.11894560
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11894560
Dataset updated
Jun 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tony Reina; Tony Reina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.

A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.

[HuggingFace](https://huggingface.co/facebook/esm2_t36_3B_UR50D) provides a "one-stop shop" to train and deploy AI models. In this case, we use Facebook's open-source [Evolutionary Scale Model (ESM-2)](https://github.com/facebookresearch/esm). These embeddings turn the protein sequences into a vector of numbers that the computer can use in a mathematical model.

To load them into Python use the Pandas library:

import pandas as pd train_data = pd.read_pickle("train_data.pkl") validation_data = pd.read_pickle("validation_data.pkl") test_data = pd.read_pickle("test_data.pkl")

The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.

The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.

The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.

The label column is whether the two proteins bind. 0 = No. 1 = Yes.

The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.

From the TDC website:

T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.

Weber et al.

Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.

Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.

Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.

References:

Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.

Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.

Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).

Dataset License: CC BY 4.0.

Contributed by: Anna Weber and Jannis Born.

The Facebook ESM-2 model has the MIT license and was published in:

* Zeming Lin et al, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). DOI: 10.1126/science.ade2574 https://www.science.org/doi/10.1126/science.ade2574

HuggingFace has several versions of the trained model.

Checkpoint name Number of layers Number of parameters
esm2_t48_15B_UR50D 48 15B
esm2_t36_3B_UR50D 36 3B
esm2_t33_650M_UR50D 33 650M
esm2_t30_150M_UR50D 30 150M
esm2_t12_35M_UR50D 12 35M
esm2_t6_8M_UR50D 6 8M
f
grouping supposed in each model.
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Virginie Fabre; Silvana Condemi; Anna Degioanni (2023). grouping supposed in each model. [Dataset]. http://doi.org/10.1371/journal.pone.0005151.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0005151.t002
Dataset updated
Jun 7, 2023
Dataset provided by
PLOS ONE
Authors
Virginie Fabre; Silvana Condemi; Anna Degioanni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
grouping supposed in each model.
m
Experiment data used in DeepHelicon
data.mendeley.com
Updated Apr 13, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianfeng Sun (2020). Experiment data used in DeepHelicon [Dataset]. http://doi.org/10.17632/k8tfvgftv3.2
Explore at:
Unique identifier
https://doi.org/10.17632/k8tfvgftv3.2
Dataset updated
Apr 13, 2020
Authors
Jianfeng Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data is for the paper titled "DeepHelicon: accurate prediction of inter-helical residue contacts in transmembrane protein by residual neural networks". It contains four sub-folders as follows: 1. Fasta: the protein sequences in the TRAIN, PREVIOUS, and TEST datasets, respectively. 2. PDB: the protein native structures in the TRAIN, PREVIOUS, and TEST datasets, respectively. 3. Predictions: the contact predictions on the PREVIOUS and TEST datasets, which are predicted by the contact prediction methods mentioned in the DeepHelicon paper. 4. 3D modelling: the 3D models, which are guided by the secondary structures predicted by SCRATCH1D and guided by the residue contacts predicted by DeepHelicon and DeepMetaPSICOV, respectively, are finally generated by CONFOLD2.
f
Adequacy analysis.
figshare.com
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Zhang; Yuxiao Wang; Lushan Wang; Peiji Gao (2023). Adequacy analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0014316.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0014316.t002
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Zheng Zhang; Yuxiao Wang; Lushan Wang; Peiji Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The median, upper and lower quartiles of the adequacies of linear fitting versus bilinear fitting, bilinear fitting versus paraboloid fitting and bilinear fitting versus cubic spline function fitting are given. The right hand column shows the percentage of the families whose adequacies are above 0.9 (33/73, 71/73, 63/73, 22/75, 74/75, respectively).
Aggregating statistically correlated metabolic pathways into groups to...
zenodo.org
explore.openaire.eu
+1more
zip
Updated Nov 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdur Rahman M. A. Basher; Abdur Rahman M. A. Basher (2021). Aggregating statistically correlated metabolic pathways into groups to improve pathway prediction outcomes [Dataset]. http://doi.org/10.5281/zenodo.5630322
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5630322
Dataset updated
Nov 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abdur Rahman M. A. Basher; Abdur Rahman M. A. Basher
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We included files used in hallamlab/chap
AlphaFold Results: Top predicted models for BM5 antibody sequences with NO...
figshare.com
txt
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Israel Desta (2023). AlphaFold Results: Top predicted models for BM5 antibody sequences with NO templates [Dataset]. http://doi.org/10.6084/m9.figshare.19652499.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19652499.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Authors
Israel Desta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AlphaFold is used to regenerate the antibody structures in BM5 starting only from the sequence and is not allowed to use any templates from the PDB. This dataset contains the top predicted models from these runs for each antibody before any relaxation.
f
Bacillus Unknown Protvec Model
open.flinders.edu.au
bin
Updated May 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susie Grigson (2022). Bacillus Unknown Protvec Model [Dataset]. http://doi.org/10.25451/flinders.19770742.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.25451/flinders.19770742.v1
Dataset updated
May 21, 2022
Dataset provided by
Flinders University
Authors
Susie Grigson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protvec model trained using 425,000 sequences from the Genome Taxonomy Database (GTDB). Sequences were dereplicated at 70% using CDHIT and filtered to remove sequences containing 'X', sequences shorter than 30 amino acids and sequences longer than 1024 amino acids.

Training used a vector size of 100 and a context size of 25.
Drug repurposing using mechanistic models of signal transduction machine...
zenodo.org
data.niaid.nih.gov
bin
Updated Sep 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Loucera; Carlos Loucera (2023). Drug repurposing using mechanistic models of signal transduction machine learning [Dataset]. http://doi.org/10.5281/zenodo.6020481
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6020481
Dataset updated
Sep 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carlos Loucera; Carlos Loucera
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(DREM³L) Drug REpurposing using Mechanistic Models of signal transduction and Machine Learning
CATHe Dataset and Weights
zenodo.org
explore.openaire.eu
+1more
Updated Mar 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vamsi Nallapareddy; Vamsi Nallapareddy; Nicola Bordin; Nicola Bordin; Ian Sillitoe; Ian Sillitoe; Michael Heinzinger; Michael Heinzinger; Maria Littmann; Maria Littmann; Vaishali Waman; Vaishali Waman; Neeladri Sen; Neeladri Sen; Burkhard Rost; Burkhard Rost; Christine Orengo; Christine Orengo (2022). CATHe Dataset and Weights [Dataset]. http://doi.org/10.5281/zenodo.6327572
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6327572
Dataset updated
Mar 18, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vamsi Nallapareddy; Vamsi Nallapareddy; Nicola Bordin; Nicola Bordin; Ian Sillitoe; Ian Sillitoe; Michael Heinzinger; Michael Heinzinger; Maria Littmann; Maria Littmann; Vaishali Waman; Vaishali Waman; Neeladri Sen; Neeladri Sen; Burkhard Rost; Burkhard Rost; Christine Orengo; Christine Orengo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of the training, optimization, and testing sets used for developing the CATHe model, which is a deep learning framework capable of detecting extremely remote homologues (< 20% sequence identity) for CATH superfamilies. Additionally, the training weights for the artificial neural network present in the CATHe model have been provided.
d
Raw motif mapping bedfile data and model training set class probabilities
search.dataone.org
data.niaid.nih.gov
+2more
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phillip Davis (2023). Raw motif mapping bedfile data and model training set class probabilities [Dataset]. http://doi.org/10.5061/dryad.tdz08kq3w
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.tdz08kq3w
Dataset updated
Nov 30, 2023
Dataset provided by
Dryad Digital Repository
Authors
Phillip Davis
Time period covered
Jan 1, 2023
Description
Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a â€˜phenotype-of-concernâ€™ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information aloneÂ is challenging with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. We developed a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks.Â By leveraging a taxonomy-guided â€˜group-shuffle-splitâ€™ cross validation paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level. We demonstrate the feasibility of this method by finding models accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (...
s
Dataset for "Limits and potential of combined folding and docking using...
figshare.scilifelab.se
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arne Elofsson; Gabriele Pozzati; Wensi Zhu; Claudio Bassot; John Lamb; Petras Kundrotas (2023). Dataset for "Limits and potential of combined folding and docking using PconsDock" [Dataset]. http://doi.org/10.6084/m9.figshare.14654886.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14654886.v2
Dataset updated
May 30, 2023
Dataset provided by
Stockholm University
Authors
Arne Elofsson; Gabriele Pozzati; Wensi Zhu; Claudio Bassot; John Lamb; Petras Kundrotas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All scripts for predictions and analysis are available from https://github.com/ElofssonLab/bioinfo-toolbox/trRosetta/Details for each run are available from https://github.com/ElofssonLab/bioinfo-toolbox/benchmark5/benchmark4.3/.All models joined alignments, and evaluation results are available from a figshare repository[44].The data is organized as follows1) One diretora (N*/ as well as ./) contains all the results and data for one set of parameters2) In each directory the following subdirectories are included2a) seq/ (all sequences)2b) pdb/ (all orginal pdb files) 2c) dimer/ all merged msa files2d) pymodel/ all models generated and the measuremenst (in csv files) to evalute their performance.3) In the director Figures/ all figures, scripts to generat them as well as summary of all predictions in a csv files is included
ENCFF422JIU.bigBed
figshare.com
bin
Updated May 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marvin Mayer (2020). ENCFF422JIU.bigBed [Dataset]. http://doi.org/10.6084/m9.figshare.12251444.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12251444.v1
Dataset updated
May 6, 2020
Dataset provided by
figshare
Authors
Marvin Mayer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WGBS bigBed format
Data from: Time of activity is a better predictor of the distribution of a...
zenodo.org
data.subak.org
+2more
bin, csv
Updated Jun 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Henrique de Oliveira Caetano; Gabriel Henrique de Oliveira Caetano; Juan Carlos Santos; Leandro Godinho; Vitor Cavalcante; Luisa Viegas; Luisa Viegas; Pedro Campelo; Lidia Martins; Alan de Oliveira; Júlio Alvarenga; Helga Wiederhecker; Verônica de Novaes e Silva; Fernanda Werneck; Donald Miles; Donald Miles; Guarino Colli; Barry Sinervo; Juan Carlos Santos; Leandro Godinho; Vitor Cavalcante; Pedro Campelo; Lidia Martins; Alan de Oliveira; Júlio Alvarenga; Helga Wiederhecker; Verônica de Novaes e Silva; Fernanda Werneck; Guarino Colli; Barry Sinervo (2022). Time of activity is a better predictor of the distribution of a tropical lizard than pure environmental temperatures [Dataset]. http://doi.org/10.5061/dryad.b2rbnzsb7
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.b2rbnzsb7
Dataset updated
Jun 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel Henrique de Oliveira Caetano; Gabriel Henrique de Oliveira Caetano; Juan Carlos Santos; Leandro Godinho; Vitor Cavalcante; Luisa Viegas; Luisa Viegas; Pedro Campelo; Lidia Martins; Alan de Oliveira; Júlio Alvarenga; Helga Wiederhecker; Verônica de Novaes e Silva; Fernanda Werneck; Donald Miles; Donald Miles; Guarino Colli; Barry Sinervo; Juan Carlos Santos; Leandro Godinho; Vitor Cavalcante; Pedro Campelo; Lidia Martins; Alan de Oliveira; Júlio Alvarenga; Helga Wiederhecker; Verônica de Novaes e Silva; Fernanda Werneck; Guarino Colli; Barry Sinervo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Environmental temperatures influence ectotherms' physiology and capacity to perform activities necessary for survival and reproduction. Time available to perform those activities is determined by thermal tolerances and environmental temperatures. Estimates of activity time might enhance our ability to predict suitable areas for species' persistence in face of climate warming, compared to the exclusive use of environmental temperatures, without considering thermal tolerances. We compare the ability of environmental temperatures and estimates of activity time to predict the geographic distribution of a tropical lizard, Tropidurus torquatus. We compared 105 estimates of activity time, resulting from the combination of four methodological decisions: (1) How to estimate daily environmental temperature variation (modeling a sinusoid wave ranging from monthly minimum to maximum temperature, extrapolating from operative temperatures measured in field or using biophysical projections of microclimate)? (2) In which temperature range are animals considered active? (3) Should these ranges be determined from body temperatures obtained in laboratory or in field? and (4) Should thermoregulation simulations be included in estimations? We show that models using estimates of activity time made with the sinusoid and biophysical methods had higher predictive accuracy than those using environmental temperatures alone. Estimates made using the central 90% of temperatures measured in a thermal gradient as the temperature range for activity also ranked higher than environmental temperatures. Thermoregulation simulations did not improve model accuracy. Precipitation ranked higher than thermally related predictors. Activity time adds important information to distribution modeling and should be considered as a predictor in studies of the distribution of ectotherms. The distribution of T. torquatus is restricted by precipitation and by the effect of lower temperatures on their time of activty and climate warming could lead to range expansion. We provide an R package "Mapinguari" with tools to generate spatial predictors based on the processes described herein.
f
Trained models
figshare.com
bridges.monash.edu
+1more
application/gzip
Updated Feb 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Wick (2019). Trained models [Dataset]. http://doi.org/10.26180/5c5a5fc61e7fa
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.26180/5c5a5fc61e7fa
Dataset updated
Feb 6, 2019
Dataset provided by
Monash University
Authors
Ryan Wick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are the two custom-trained Guppy models used in the study: custom-Kp and custom-Kp-big-net.
Dataset for Set2Gaussian
figshare.com
zip
Updated Dec 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sheng wang (2019). Dataset for Set2Gaussian [Dataset]. http://doi.org/10.6084/m9.figshare.11341181.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11341181.v1
Dataset updated
Dec 9, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
sheng wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pretrian Gaussian embeddings for Set2Gaussian.
(DRExM³L) Drug REpurposing using eXplainable Machine Learning and...
zenodo.org
application/gzip, bin +1
Updated Sep 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Loucera; Carlos Loucera; Marina Esteban-Medina; Marina Esteban-Medina; María Peña-Chilet; María Peña-Chilet (2023). (DRExM³L) Drug REpurposing using eXplainable Machine Learning and Mechanistic Models of signal transduction [Dataset]. http://doi.org/10.5281/zenodo.6037213
Explore at:
bin, application/gzip, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6037213
Dataset updated
Sep 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carlos Loucera; Carlos Loucera; Marina Esteban-Medina; Marina Esteban-Medina; María Peña-Chilet; María Peña-Chilet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training data for the "Drug REpurposing using eXplainable Machine Learning and Mechanistic Models of signal transduction" (DRExM³L) package.
calcin_data
figshare.com
zip
Updated Oct 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Santibanez; Ricardo Kriebel; Jesus Ballesteros; Nathaniel Rush; Zachary Witter; John Williams; Daniel Janies; Prashant Sharma; Carlos E Santibañez Lopez (2018). calcin_data [Dataset]. http://doi.org/10.6084/m9.figshare.5686633.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5686633.v1
Dataset updated
Oct 15, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Carlos Santibanez; Ricardo Kriebel; Jesus Ballesteros; Nathaniel Rush; Zachary Witter; John Williams; Daniel Janies; Prashant Sharma; Carlos E Santibañez Lopez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scorpions have evolved venoms with diverse variety of toxins with a plethora of biological targets, but characterizing the evolution of this molecular diversity has been limited by the lack of a comprehensive phylogenetic hypothesis of scorpion relationships. Here, we elucidate the origin of calcins and lambda potassium channel toxins (LKTx), a type of inhibitor cystine knot (ICK) toxins. In this study, we compiled the most comprehensive sampling of ICK peptide homologs across the breath of scorpion diversity. Using a de novo phylogenomic tree, and morphometric analysis of 3D molecular models of these peptides, we traced the evolutionary change in the shape of these peptides. The discovery of phylogenetic signal in 3D molecular models of calcins and LKTx, provides the first synapomorphies for the two extant scorpion parvorders, which are otherwise morphologically diverse and undefined. Our morphometric analyses further reveal that calcins and LKTx evolved different shapes with minimal overlap in morphospace, supporting their characterization as stable (non-convergent) synapomorphies.
r
Data from: Feature ranking and feature redundancy reduction for prognostic...
researchdata.edu.au
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qihua Tan; Mads Thomassen; Kaare Christensen; Torben A. Kruse (2022). Feature ranking and feature redundancy reduction for prognostic microarray study of tumor clinical outcomes [Dataset]. http://doi.org/10.4225/03/5a1372383442b
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a1372383442b
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Qihua Tan; Mads Thomassen; Kaare Christensen; Torben A. Kruse
Description
Different from significant gene expression analysis which looks for all genes that are differentially regulated, feature selection in prognostic gene expression analysis aims at finding a subset of informative marker genes that are discriminative for prediction. Unfortunately feature selection in the literature of microarray study is predominated by the simple heuristic univariate gene filter paradigm that selects differentially expressed genes according to their statistical significance. Since the univariate approach does not take into account the correlated or interactive structure among the genes, classifiers built on genes so selected can be less accurate. More advanced approaches based on multivariate models have to be considered. Here, we introduce a feature ranking method through forward orthogonal search to assist prognostic gene selection. Application to published gene-lists selected by univariate models shows that the feature space can be largely reduced while achieving improved testing performances. Our results indicate that "significant" features selected using the gene-wised approaches can contain irrelevant genes that only serve to complicate model building. Multivariate feature ranking can help to reduce feature redundancy and to select highly informative prognostic marker genes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer (2023). Research data for "Subjective data models in bioinformatics: Do wet-lab and computational biologists comprehend data differently?" [Dataset]. http://doi.org/10.48420/20641017.v2

Research data for "Subjective data models in bioinformatics: Do wet-lab and computational biologists comprehend data differently?"

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.48420/20641017.v2

Dataset updated

Jun 1, 2023

Dataset provided by

University of Manchester

Authors

Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Subjective data models dataset

This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".

Todo: link paper/preprint once published.

Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis

Files

Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview. Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase. Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review. anonymous_participant_list.csv shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set. cards.txt is a full list of the cards presented in the tasks. background survey and background manual annotations are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text. codes.csv shows the qualitative codes used within the transcripts. entry_point.csv is a record of participants' identified entry points into the data. file_mapping_responses shows a record of responses to the file mapping task.

Clear search

Close search

Google apps

Main menu

Checkpoint name	Number of layers	Number of parameters
esm2_t48_15B_UR50D	48	15B
esm2_t36_3B_UR50D	36	3B
esm2_t33_650M_UR50D	33	650M
esm2_t30_150M_UR50D	30	150M
esm2_t12_35M_UR50D	12	35M
esm2_t6_8M_UR50D	6	8M

Research data for "Subjective data models in bioinformatics: Do wet-lab and...

Bioinformatics Simulated

ESM-2 embeddings for TCR-Epitope Binding Affinity Prediction Task

grouping supposed in each model.

Experiment data used in DeepHelicon

Adequacy analysis.

Aggregating statistically correlated metabolic pathways into groups to...

AlphaFold Results: Top predicted models for BM5 antibody sequences with NO...

Bacillus Unknown Protvec Model

Drug repurposing using mechanistic models of signal transduction machine...

CATHe Dataset and Weights

Raw motif mapping bedfile data and model training set class probabilities

Dataset for "Limits and potential of combined folding and docking using...

ENCFF422JIU.bigBed

Data from: Time of activity is a better predictor of the distribution of a...

Trained models

Dataset for Set2Gaussian

(DRExM³L) Drug REpurposing using eXplainable Machine Learning and...

calcin_data

Data from: Feature ranking and feature redundancy reduction for prognostic...

Research data for "Subjective data models in bioinformatics: Do wet-lab and computational biologists comprehend data differently?"