10 datasets found

NR Viral TrEMBL
figshare.com
bz2
Updated Feb 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feargal Ryan (2018). NR Viral TrEMBL [Dataset]. http://doi.org/10.6084/m9.figshare.5822166.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5822166.v1
Dataset updated
Feb 9, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Feargal Ryan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The viral subset of the TrEMBL database clustered at 95% identity at the amino acid level to remove redundancy.
f
feature-representation-for-LLMs
figshare.com
xlsx
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wang rui; yujuan zhang; Zeyu Luo (2024). feature-representation-for-LLMs [Dataset]. http://doi.org/10.6084/m9.figshare.24312292.v6
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24312292.v6
Dataset updated
Mar 28, 2024
Dataset provided by
figshare
Authors
wang rui; yujuan zhang; Zeyu Luo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a database for feature representation of ESM2, which includes Swiss data, Swiss normalized data, original TrEMBL data, original TrEMBL normalized data, non-homology TrEMBL data and Table S10.Non-homologous TrEMBL normalized data can be created by extracting Entry ID from the non-homologous TrEMBL data and then extracting the corresponding feature representation from the original TrEMBL normalized data.Figure S4 (eos) and Figure S5 (eos) are supplement for the Histogram plots and Scatter plots of feature eos in corresponding Figure S4 and Figure S5.Figure S6 and Figure S8 are the results of GO annotation enrichment. The GO gene set is a grouped protein dataset used for GO annotation enrichment.Figure S7 is a silhouette score plot.For specific usage of the dataset, please refer to Github.The RF_model files are pickle files for different RF models, which can be used for dataset inference and interpretable analysis. Among these models, the AA_count model and feature_all model have more complex feature inputs. Therefore, we provide the Swiss training dataset as a reference for feature arrangement. The feature order for other models is simply from 0 to 1279.
n
ExPASy Biochemical Pathways
neuinfo.org
scicrunch.org
+1more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ExPASy Biochemical Pathways [Dataset]. http://identifiers.org/RRID:SCR_007944
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007944
Dataset updated
Jan 29, 2022
Description
The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE. It is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include format and content enhancements, cross-references to additional databases, new documentation files and improvements to TrEMBL, a computer-annotated supplement to SWISS-PROT.
TemStaPro Datasets
zenodo.org
application/gzip, tsv
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ieva Pudžiuvelytė; Ieva Pudžiuvelytė; Kliment Olechnovič; Kliment Olechnovič; Egle Godliauskaite; Egle Godliauskaite; Kristupas Sermokas; Kristupas Sermokas; Tomas Urbaitis; Giedrius Gasiunas; Giedrius Gasiunas; Darius Kazlauskas; Darius Kazlauskas; Tomas Urbaitis (2024). TemStaPro Datasets [Dataset]. http://doi.org/10.5281/zenodo.10463156
Explore at:
application/gzip, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10463156
Dataset updated
Apr 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ieva Pudžiuvelytė; Ieva Pudžiuvelytė; Kliment Olechnovič; Kliment Olechnovič; Egle Godliauskaite; Egle Godliauskaite; Kristupas Sermokas; Kristupas Sermokas; Tomas Urbaitis; Giedrius Gasiunas; Giedrius Gasiunas; Darius Kazlauskas; Darius Kazlauskas; Tomas Urbaitis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 14, 2024
Description
This dataset contains protein sequences used to train, validate, and test binary classifiers that form TemStaPro program, which is applied for protein thermostability prediction with respect to nine temperature thresholds from 40 to 80 degrees Celsius using a step of five degrees.

The data is given in files of FASTA format. Each protein sequence has a header made of three values separated by vertical bar symbols: organism's, to which the protein belongs, UniParc taxonomy identifier; UniProtKB/TrEMBL identifier of the protein sequence; organism's growth temperature taken from the dataset of growth temperatures of over 21 thousand organisms (Engqvist, 2018).

TemStaPro-Major-30 set is composed of 12 files:

one training

one validation

one imbalanced testing

nine balanced samples of 2000 sequences from each of the balanced testing set

TemStaPro-Minor-30 set is composed of cross-validation and testing files all balanced for 65 degrees Celsius temperature threshold.

SupplementaryFileC2EPsPredictions.tsv file contains thermostability predictions using the default mode of TemStaPro program to check the thermostability of different C2EP groups.

The detailed description is given in the revised version of the corresponding paper (https://doi.org/10.1093/bioinformatics/btae157).

If you use the data from this dataset, please cite both the paper and the DOI of the dataset.
Diamond annotation
figshare.com
txt
Updated Oct 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
celine noirot (2020). Diamond annotation [Dataset]. http://doi.org/10.6084/m9.figshare.12320735.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12320735.v1
Dataset updated
Oct 30, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
celine noirot
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
p { margin-bottom: 0.25cm; direction: ltr; line-height: 120%; text-align: left; orphans: 2; widows: 2 } a:link { color: #0000ff }

Annotation list for Calanus finmarchicus reference transcriptome using DIAMOND3.

Contigs were aligned with DIAMOND??3 on NR (2019-09-29), Swissprot and Trembl (2018-12) to retrieve corresponding best annotations.

An annotation matrix was then generated by selecting the best hit for each database if: i) the percent of the query length covered by the alignment was higher than 60% ; ii) the percent of the subject length covered by the alignment was higher than 40%; iii) the percent of identity of the alignment was higher than 40%. File diamond_annotation_206k.tsv is the annotation list for Calanus finmarchicus reference transcriptome using DIAMOND3 (36,274 contigs with an annotation in at least one database out of the 206,012). File diamond_annotation_76k.tsv is the annotation list for the 76,550 contigs expressed with more than 1 CPM in the RNA sequencing Bioproject PRJNA628886 using DIAMOND3 (22,527 contigs with an annotation in at least one database out of the 76,550).

3DIAMOND: version v0.9.22, parameters: -f 6 qseqid qlen qcovhsp pident score evalue length sseqid slen stitle. Ref: Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

Related to bioproject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA628886
f
DataSheet1_A Baseline Evaluation of Bioinformatics Capacity in Tanzania...
frontiersin.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raphael Zozimus Sangeda; Aneth David Mwakilili; Upendo Masamu; Siana Nkya; Liberata Alexander Mwita; Deogracious Protas Massawe; Sylvester Leonard Lyantagaye; Julie Makani (2023). DataSheet1_A Baseline Evaluation of Bioinformatics Capacity in Tanzania Reveals Areas for Training.pdf [Dataset]. http://doi.org/10.3389/feduc.2021.665313.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2021.665313.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Raphael Zozimus Sangeda; Aneth David Mwakilili; Upendo Masamu; Siana Nkya; Liberata Alexander Mwita; Deogracious Protas Massawe; Sylvester Leonard Lyantagaye; Julie Makani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tanzania
Description
Due to the insufficient human and infrastructure capacity to use novel genomics and bioinformatics technologies, Sub-Saharan Africa countries have not entirely ripped the benefits of these technologies in health and other sectors. The main objective of this study was to map out the interest and capacity for conducting bioinformatics and related research in Tanzania. The survey collected demographic information like age group, experience, seniority level, gender, number of respondents per institution, number of publications, and willingness to join the community of practice. The survey also investigated the capacity of individuals and institutions about computing infrastructure, operating system use, statistical packages in use, the basic Microsoft packages experience, programming language experience, bioinformatics tools and resources usage, and type of analyses performed. Moreover, respondents were surveyed about the challenges they faced in implementing bioinformatics and their willingness to join the bioinformatics community of practice in Tanzania. Out of 84 respondents, 50 (59.5%) were males. More than half of these 44 (52.4%) were between 26–32 years. The majority, 41 (48.8%), were master’s degree holders with at least one publication related to bioinformatics. Eighty (95.2%) were willing to join the bioinformatics network and initiative in Tanzania. The major challenge faced by 22 (26.2%) respondents was the lack of training and skills. The most used resources for bioinformatics analyses were BLAST, PubMed, and GenBank. Most respondents who performed analyses included sequence alignment and phylogenetics, which was reported by 57 (67.9%) and 42 (50%) of the respondents, respectively. The most frequently used statistical software packages were SPSS and R. A quarter of the respondents were conversant with computer programming. Early career and young scientists were the largest groups of responders engaged in bioinformatics research and activities across surveyed institutions in Tanzania. The use of bioinformatics tools for analysis is still low, including basic analysis tools such as BLAST, GenBank, sequence alignment software, Swiss-prot and TrEMBL. There is also poor access to resources and tools for bioinformatics analyses. To address the skills and resources gaps, we recommend various modes of training and capacity building of relevant bioinformatics skills and infrastructure to improve bioinformatics capacity in Tanzania.
d
Structural models of Cheiracanthium punctoriumspider toxins with putative...
search.dataone.org
datadryad.org
Updated Oct 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim LÃ¼ddecke; Josephine Dresler; Sabine Hurka; Volker Herzig (2025). Structural models of Cheiracanthium punctorium spider toxins with putative defensive function (CSTX-type and phospholipase A2) [Dataset]. http://doi.org/10.5061/dryad.fn2z34v7t
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fn2z34v7t
Dataset updated
Oct 7, 2025
Dataset provided by
Dryad Digital Repository
Authors
Tim LÃ¼ddecke; Josephine Dresler; Sabine Hurka; Volker Herzig
Description
Spider venom is an important but evolutionarily poorly understudied functional trait. In this study, we have sequenced multiple venom gland transcriptomes from various spiders and analyzed their venom profile. The transcriptomes were generated via Illumina TruSeq RNA Sample Prep Kit v2 or TruSeq Stranded mRNA Library Prep Kit (paired-end, 151-bp read length), sequenced with Macrogen (Korea) using different Illumina chemistries an assembled de novo using a pipeline incorporating Trinity v2.13.2/2.15.1 as well as rnaSPAdes v3.15.5. Identified toxin precursors were annotated using InterProScan v5.61-93.0/5.69-101.0 and a DIAMOND v2.0.15/2.1.9 blastp search against the public available databases VenomZone, UniProtKB/Swiss-Prot Tox-Prot, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL v2022_05/2024_04 was performed. Our analysis revealed the presence of novel double-domain toxins from the CSTX family and some phospholipase A2 toxins sequence-wise resembling homolgs from honeybees and scorpions, in..., RNA extraction and sequencing were outsourced to Macrogen. Following RNA extraction, libraries were constructed using the Illumina TruSeq RNA Sample Prep Kit v2 or TruSeq Stranded mRNA Library Prep Kit (paired-end, 151-bp read length). Quality was controlled by the verification of PCR-enriched fragment sizes using an Agilent Technologies 2100 Bioanalyzer with a DNA 1000 chip. The library quantity was determined by qPCR using the rapid library standard quantification solution and calculator (Roche). Transcriptome data were processed using a modified version of our in-house assembly and annotation pipeline62. All input sequences were inspected using FastQC v0.11.9/0.12.1 (www.bioinformatics.babraham.ac.uk) before trimming with cutadapt v4.2/4.9. The trimmed reads were corrected using Rcorrector v1.0.5/1.0.779 and assembled de novo using a pipeline incorporating Trinity v2.13.2/2.15.1 with a minimum contig size of 30 bp and maximum read normalization of 50 and rnaSPAdes v3.15.5 with and wi..., # Structural models of Cheiracanthium punctorium spider toxins with putative defensive function (CSTX-type and phospholipase A2)

Dataset DOI: 10.5061/dryad.fn2z34v7t

Description of the data and file structure

In order to grasp the architecture of doule-domain CSTX toxins and to understand whether the sequence-wise suggested similarity of phospholipase A~2~ toxins from the Nurses thorn finger (C*heiracanthium punctorium*) are reminsicent to similarities in folding, structure and, hence, potentially function, we employed structural modelling of the sequenced transcripts. We modelled selected CSTX toxins (CPTX1a, CPTX 2d, CPTX 5a and CPTX5d) as well asidentified phospholipsae A~2~ toxins from C. punctorium (CPTX14a) and their most similar homologs from bees and scorpions (P00630 and Q6T178). The models (used to generate figure 3 and 5 from linked publication/preprint) are created via alphafold3 from sequence data under default parameters.

Files a...,
MebiPred predictions for UniProt - Jul 2023
figshare.com
application/gzip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R Prabakaran (2023). MebiPred predictions for UniProt - Jul 2023 [Dataset]. http://doi.org/10.6084/m9.figshare.24101013.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24101013.v1
Dataset updated
Sep 7, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
R Prabakaran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files contain the MeBiPred predictions for all the proteins (>248M) listed in UniProtKB - SwissProt and TrEMBL. MeBiPred could be used to screen proteins that bind to metals (Na, K, Mg, Ca, Mn, Fe, Co, Ni, Cu, Zn). Related links:Publication: https://doi.org/10.1093/bioinformatics/btac358Web App: https://services.bromberglab.org/mebipred/homePython Package: https://pypi.org/project/mymetal/
d
LegumeIP
dknet.org
scicrunch.org
Updated Oct 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). LegumeIP [Dataset]. http://identifiers.org/RRID:SCR_008906/resolver/mentions?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008906 https://identifiers.org/RRID:SCR_008906/resolver/mentions?q=&i=rrid
Dataset updated
Oct 13, 2025
Description
LegumeIP is an integrative database and bioinformatics platform for comparative genomics and transcriptomics to facilitate the study of gene function and genome evolution in legumes, and ultimately to generate molecular based breeding tools to improve quality of crop legumes. LegumeIP currently hosts large-scale genomics and transcriptomics data, including: * Genomic sequences of three model legumes, i.e. Medicago truncatula, Glycine max (soybean) and Lotus japonicus, including two reference plant species, Arabidopsis thaliana and Poplar trichocarpa, with the annotation based on UniProt TrEMBL, InterProScan, Gene Ontology and KEGG databases. LegumeIP covers a total 222,217 protein-coding gene sequences. * Large-scale gene expression data compiled from 104 array hybridizations from L. japonicas, 156 array hybridizations from M. truncatula gene atlas database, and 14 RNA-Seq-based gene expression profiles from G. max on different tissues including four common tissues: Nodule, Flower, Root and Leaf. * Systematic synteny analysis among M. truncatula, G. max, L. japonicus and A. thaliana. * Reconstruction of gene family and gene family-wide phylogenetic analysis across the five hosted species. LegumeIP features comprehensive search and visualization tools to enable the flexible query on gene annotation, gene family, synteny, relative abundance of gene expression.
De Novo Characterization of the Mung Bean Transcriptome and Transcriptomic...
plos.figshare.com
tiff
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shi-Weng Li; Rui-Fang Shi; Yan Leng (2023). De Novo Characterization of the Mung Bean Transcriptome and Transcriptomic Analysis of Adventitious Rooting in Seedlings Using RNA-Seq [Dataset]. http://doi.org/10.1371/journal.pone.0132969
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0132969
Dataset updated
Jun 6, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shi-Weng Li; Rui-Fang Shi; Yan Leng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adventitious rooting is the most important mechanism underlying vegetative propagation and an important strategy for plant propagation under environmental stress. The present study was conducted to obtain transcriptomic data and examine gene expression using RNA-Seq and bioinformatics analysis, thereby providing a foundation for understanding the molecular mechanisms controlling adventitious rooting. Three cDNA libraries constructed from mRNA samples from mung bean hypocotyls during adventitious rooting were sequenced. These three samples generated a total of 73 million, 60 million, and 59 million 100-bp reads, respectively. These reads were assembled into 78,697 unigenes with an average length of 832 bp, totaling 65 Mb. The unigenes were aligned against six public protein databases, and 29,029 unigenes (36.77%) were annotated using BLASTx. Among them, 28,225 (35.75%) and 28,119 (35.62%) unigenes had homologs in the TrEMBL and NCBI non-redundant (Nr) databases, respectively. Of these unigenes, 21,140 were assigned to gene ontology classes, and a total of 11,990 unigenes were classified into 25 KOG functional categories. A total of 7,357 unigenes were annotated to 4,524 KOs, and 4,651 unigenes were mapped onto 342 KEGG pathways using BLAST comparison against the KEGG database. A total of 11,717 unigenes were differentially expressed (fold change>2) during the root induction stage, with 8,772 unigenes down-regulated and 2,945 unigenes up-regulated. A total of 12,737 unigenes were differentially expressed during the root initiation stage, with 9,303 unigenes down-regulated and 3,434 unigenes up-regulated. A total of 5,334 unigenes were differentially expressed between the root induction and initiation stage, with 2,167 unigenes down-regulated and 3,167 unigenes up-regulated. qRT-PCR validation of the 39 genes with known functions indicated a strong correlation (92.3%) with the RNA-Seq data. The GO enrichment, pathway mapping, and gene expression profiles reveal molecular traits for root induction and initiation. This study provides a platform for functional genomic research with this species.
Not seeing a result you expected?
Learn how you can add new datasets to our index.