10 datasets found
  1. NR Viral TrEMBL

    • figshare.com
    bz2
    Updated Feb 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feargal Ryan (2018). NR Viral TrEMBL [Dataset]. http://doi.org/10.6084/m9.figshare.5822166.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    Feb 9, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Feargal Ryan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The viral subset of the TrEMBL database clustered at 95% identity at the amino acid level to remove redundancy.

  2. f

    feature-representation-for-LLMs

    • figshare.com
    xlsx
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wang rui; yujuan zhang; Zeyu Luo (2024). feature-representation-for-LLMs [Dataset]. http://doi.org/10.6084/m9.figshare.24312292.v6
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    figshare
    Authors
    wang rui; yujuan zhang; Zeyu Luo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a database for feature representation of ESM2, which includes Swiss data, Swiss normalized data, original TrEMBL data, original TrEMBL normalized data, non-homology TrEMBL data and Table S10.Non-homologous TrEMBL normalized data can be created by extracting Entry ID from the non-homologous TrEMBL data and then extracting the corresponding feature representation from the original TrEMBL normalized data.Figure S4 (eos) and Figure S5 (eos) are supplement for the Histogram plots and Scatter plots of feature eos in corresponding Figure S4 and Figure S5.Figure S6 and Figure S8 are the results of GO annotation enrichment. The GO gene set is a grouped protein dataset used for GO annotation enrichment.Figure S7 is a silhouette score plot.For specific usage of the dataset, please refer to Github.The RF_model files are pickle files for different RF models, which can be used for dataset inference and interpretable analysis. Among these models, the AA_count model and feature_all model have more complex feature inputs. Therefore, we provide the Swiss training dataset as a reference for feature arrangement. The feature order for other models is simply from 0 to 1279.

  3. n

    ExPASy Biochemical Pathways

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ExPASy Biochemical Pathways [Dataset]. http://identifiers.org/RRID:SCR_007944
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE. It is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include format and content enhancements, cross-references to additional databases, new documentation files and improvements to TrEMBL, a computer-annotated supplement to SWISS-PROT.

  4. TemStaPro Datasets

    • zenodo.org
    application/gzip, tsv
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ieva Pudžiuvelytė; Ieva Pudžiuvelytė; Kliment Olechnovič; Kliment Olechnovič; Egle Godliauskaite; Egle Godliauskaite; Kristupas Sermokas; Kristupas Sermokas; Tomas Urbaitis; Giedrius Gasiunas; Giedrius Gasiunas; Darius Kazlauskas; Darius Kazlauskas; Tomas Urbaitis (2024). TemStaPro Datasets [Dataset]. http://doi.org/10.5281/zenodo.10463156
    Explore at:
    application/gzip, tsvAvailable download formats
    Dataset updated
    Apr 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ieva Pudžiuvelytė; Ieva Pudžiuvelytė; Kliment Olechnovič; Kliment Olechnovič; Egle Godliauskaite; Egle Godliauskaite; Kristupas Sermokas; Kristupas Sermokas; Tomas Urbaitis; Giedrius Gasiunas; Giedrius Gasiunas; Darius Kazlauskas; Darius Kazlauskas; Tomas Urbaitis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 14, 2024
    Description

    This dataset contains protein sequences used to train, validate, and test binary classifiers that form TemStaPro program, which is applied for protein thermostability prediction with respect to nine temperature thresholds from 40 to 80 degrees Celsius using a step of five degrees.

    The data is given in files of FASTA format. Each protein sequence has a header made of three values separated by vertical bar symbols: organism's, to which the protein belongs, UniParc taxonomy identifier; UniProtKB/TrEMBL identifier of the protein sequence; organism's growth temperature taken from the dataset of growth temperatures of over 21 thousand organisms (Engqvist, 2018).

    TemStaPro-Major-30 set is composed of 12 files:

    • one training
    • one validation
    • one imbalanced testing
    • nine balanced samples of 2000 sequences from each of the balanced testing set

    TemStaPro-Minor-30 set is composed of cross-validation and testing files all balanced for 65 degrees Celsius temperature threshold.

    SupplementaryFileC2EPsPredictions.tsv file contains thermostability predictions using the default mode of TemStaPro program to check the thermostability of different C2EP groups.

    The detailed description is given in the revised version of the corresponding paper (https://doi.org/10.1093/bioinformatics/btae157).

    If you use the data from this dataset, please cite both the paper and the DOI of the dataset.

  5. Diamond annotation

    • figshare.com
    txt
    Updated Oct 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    celine noirot (2020). Diamond annotation [Dataset]. http://doi.org/10.6084/m9.figshare.12320735.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 30, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    celine noirot
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    p { margin-bottom: 0.25cm; direction: ltr; line-height: 120%; text-align: left; orphans: 2; widows: 2 } a:link { color: #0000ff }

    Annotation list for Calanus finmarchicus reference transcriptome using DIAMOND3.

    Contigs were aligned with DIAMOND??3 on NR (2019-09-29), Swissprot and Trembl (2018-12) to retrieve corresponding best annotations.

    An annotation matrix was then generated by selecting the best hit for each database if: i) the percent of the query length covered by the alignment was higher than 60% ; ii) the percent of the subject length covered by the alignment was higher than 40%; iii) the percent of identity of the alignment was higher than 40%. File diamond_annotation_206k.tsv is the annotation list for Calanus finmarchicus reference transcriptome using DIAMOND3 (36,274 contigs with an annotation in at least one database out of the 206,012). File diamond_annotation_76k.tsv is the annotation list for the 76,550 contigs expressed with more than 1 CPM in the RNA sequencing Bioproject PRJNA628886 using DIAMOND3 (22,527 contigs with an annotation in at least one database out of the 76,550).

    3DIAMOND: version v0.9.22, parameters: -f 6 qseqid qlen qcovhsp pident score evalue length sseqid slen stitle. Ref: Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

    Related to bioproject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA628886

  6. f

    DataSheet1_A Baseline Evaluation of Bioinformatics Capacity in Tanzania...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raphael Zozimus Sangeda; Aneth David Mwakilili; Upendo Masamu; Siana Nkya; Liberata Alexander Mwita; Deogracious Protas Massawe; Sylvester Leonard Lyantagaye; Julie Makani (2023). DataSheet1_A Baseline Evaluation of Bioinformatics Capacity in Tanzania Reveals Areas for Training.pdf [Dataset]. http://doi.org/10.3389/feduc.2021.665313.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Raphael Zozimus Sangeda; Aneth David Mwakilili; Upendo Masamu; Siana Nkya; Liberata Alexander Mwita; Deogracious Protas Massawe; Sylvester Leonard Lyantagaye; Julie Makani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Tanzania
    Description

    Due to the insufficient human and infrastructure capacity to use novel genomics and bioinformatics technologies, Sub-Saharan Africa countries have not entirely ripped the benefits of these technologies in health and other sectors. The main objective of this study was to map out the interest and capacity for conducting bioinformatics and related research in Tanzania. The survey collected demographic information like age group, experience, seniority level, gender, number of respondents per institution, number of publications, and willingness to join the community of practice. The survey also investigated the capacity of individuals and institutions about computing infrastructure, operating system use, statistical packages in use, the basic Microsoft packages experience, programming language experience, bioinformatics tools and resources usage, and type of analyses performed. Moreover, respondents were surveyed about the challenges they faced in implementing bioinformatics and their willingness to join the bioinformatics community of practice in Tanzania. Out of 84 respondents, 50 (59.5%) were males. More than half of these 44 (52.4%) were between 26–32 years. The majority, 41 (48.8%), were master’s degree holders with at least one publication related to bioinformatics. Eighty (95.2%) were willing to join the bioinformatics network and initiative in Tanzania. The major challenge faced by 22 (26.2%) respondents was the lack of training and skills. The most used resources for bioinformatics analyses were BLAST, PubMed, and GenBank. Most respondents who performed analyses included sequence alignment and phylogenetics, which was reported by 57 (67.9%) and 42 (50%) of the respondents, respectively. The most frequently used statistical software packages were SPSS and R. A quarter of the respondents were conversant with computer programming. Early career and young scientists were the largest groups of responders engaged in bioinformatics research and activities across surveyed institutions in Tanzania. The use of bioinformatics tools for analysis is still low, including basic analysis tools such as BLAST, GenBank, sequence alignment software, Swiss-prot and TrEMBL. There is also poor access to resources and tools for bioinformatics analyses. To address the skills and resources gaps, we recommend various modes of training and capacity building of relevant bioinformatics skills and infrastructure to improve bioinformatics capacity in Tanzania.

  7. d

    Structural models of Cheiracanthium punctoriumspider toxins with putative...

    • search.dataone.org
    • datadryad.org
    Updated Oct 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Lüddecke; Josephine Dresler; Sabine Hurka; Volker Herzig (2025). Structural models of Cheiracanthium punctorium spider toxins with putative defensive function (CSTX-type and phospholipase A2) [Dataset]. http://doi.org/10.5061/dryad.fn2z34v7t
    Explore at:
    Dataset updated
    Oct 7, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Tim Lüddecke; Josephine Dresler; Sabine Hurka; Volker Herzig
    Description

    Spider venom is an important but evolutionarily poorly understudied functional trait. In this study, we have sequenced multiple venom gland transcriptomes from various spiders and analyzed their venom profile. The transcriptomes were generated via Illumina TruSeq RNA Sample Prep Kit v2 or TruSeq Stranded mRNA Library Prep Kit (paired-end, 151-bp read length), sequenced with Macrogen (Korea) using different Illumina chemistries an assembled de novo using a pipeline incorporating Trinity v2.13.2/2.15.1 as well as rnaSPAdes v3.15.5. Identified toxin precursors were annotated using InterProScan v5.61-93.0/5.69-101.0 and a DIAMOND v2.0.15/2.1.9 blastp search against the public available databases VenomZone, UniProtKB/Swiss-Prot Tox-Prot, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL v2022_05/2024_04 was performed. Our analysis revealed the presence of novel double-domain toxins from the CSTX family and some phospholipase A2 toxins sequence-wise resembling homolgs from honeybees and scorpions, in..., RNA extraction and sequencing were outsourced to Macrogen. Following RNA extraction, libraries were constructed using the Illumina TruSeq RNA Sample Prep Kit v2 or TruSeq Stranded mRNA Library Prep Kit (paired-end, 151-bp read length). Quality was controlled by the verification of PCR-enriched fragment sizes using an Agilent Technologies 2100 Bioanalyzer with a DNA 1000 chip. The library quantity was determined by qPCR using the rapid library standard quantification solution and calculator (Roche). Transcriptome data were processed using a modified version of our in-house assembly and annotation pipeline62. All input sequences were inspected using FastQC v0.11.9/0.12.1 (www.bioinformatics.babraham.ac.uk) before trimming with cutadapt v4.2/4.9. The trimmed reads were corrected using Rcorrector v1.0.5/1.0.779 and assembled de novo using a pipeline incorporating Trinity v2.13.2/2.15.1 with a minimum contig size of 30 bp and maximum read normalization of 50 and rnaSPAdes v3.15.5 with and wi..., # Structural models of Cheiracanthium punctorium spider toxins with putative defensive function (CSTX-type and phospholipase A2)

    Dataset DOI: 10.5061/dryad.fn2z34v7t

    Description of the data and file structure

    In order to grasp the architecture of doule-domain CSTX toxins and to understand whether the sequence-wise suggested similarity of phospholipase A~2~ toxins from the Nurses thorn finger (C*heiracanthium punctorium*) are reminsicent to similarities in folding, structure and, hence, potentially function, we employed structural modelling of the sequenced transcripts. We modelled selected CSTX toxins (CPTX1a, CPTX 2d, CPTX 5a and CPTX5d) as well asidentified phospholipsae A~2~ toxins from C. punctorium (CPTX14a) and their most similar homologs from bees and scorpions (P00630 and Q6T178). The models (used to generate figure 3 and 5 from linked publication/preprint) are created via alphafold3 from sequence data under default parameters.

    Files a...,

  8. MebiPred predictions for UniProt - Jul 2023

    • figshare.com
    application/gzip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R Prabakaran (2023). MebiPred predictions for UniProt - Jul 2023 [Dataset]. http://doi.org/10.6084/m9.figshare.24101013.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    R Prabakaran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The files contain the MeBiPred predictions for all the proteins (>248M) listed in UniProtKB - SwissProt and TrEMBL. MeBiPred could be used to screen proteins that bind to metals (Na, K, Mg, Ca, Mn, Fe, Co, Ni, Cu, Zn). Related links:Publication: https://doi.org/10.1093/bioinformatics/btac358Web App: https://services.bromberglab.org/mebipred/homePython Package: https://pypi.org/project/mymetal/

  9. d

    LegumeIP

    • dknet.org
    • scicrunch.org
    Updated Oct 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). LegumeIP [Dataset]. http://identifiers.org/RRID:SCR_008906/resolver/mentions?q=&i=rrid
    Explore at:
    Dataset updated
    Oct 13, 2025
    Description

    LegumeIP is an integrative database and bioinformatics platform for comparative genomics and transcriptomics to facilitate the study of gene function and genome evolution in legumes, and ultimately to generate molecular based breeding tools to improve quality of crop legumes. LegumeIP currently hosts large-scale genomics and transcriptomics data, including: * Genomic sequences of three model legumes, i.e. Medicago truncatula, Glycine max (soybean) and Lotus japonicus, including two reference plant species, Arabidopsis thaliana and Poplar trichocarpa, with the annotation based on UniProt TrEMBL, InterProScan, Gene Ontology and KEGG databases. LegumeIP covers a total 222,217 protein-coding gene sequences. * Large-scale gene expression data compiled from 104 array hybridizations from L. japonicas, 156 array hybridizations from M. truncatula gene atlas database, and 14 RNA-Seq-based gene expression profiles from G. max on different tissues including four common tissues: Nodule, Flower, Root and Leaf. * Systematic synteny analysis among M. truncatula, G. max, L. japonicus and A. thaliana. * Reconstruction of gene family and gene family-wide phylogenetic analysis across the five hosted species. LegumeIP features comprehensive search and visualization tools to enable the flexible query on gene annotation, gene family, synteny, relative abundance of gene expression.

  10. De Novo Characterization of the Mung Bean Transcriptome and Transcriptomic...

    • plos.figshare.com
    tiff
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi-Weng Li; Rui-Fang Shi; Yan Leng (2023). De Novo Characterization of the Mung Bean Transcriptome and Transcriptomic Analysis of Adventitious Rooting in Seedlings Using RNA-Seq [Dataset]. http://doi.org/10.1371/journal.pone.0132969
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shi-Weng Li; Rui-Fang Shi; Yan Leng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adventitious rooting is the most important mechanism underlying vegetative propagation and an important strategy for plant propagation under environmental stress. The present study was conducted to obtain transcriptomic data and examine gene expression using RNA-Seq and bioinformatics analysis, thereby providing a foundation for understanding the molecular mechanisms controlling adventitious rooting. Three cDNA libraries constructed from mRNA samples from mung bean hypocotyls during adventitious rooting were sequenced. These three samples generated a total of 73 million, 60 million, and 59 million 100-bp reads, respectively. These reads were assembled into 78,697 unigenes with an average length of 832 bp, totaling 65 Mb. The unigenes were aligned against six public protein databases, and 29,029 unigenes (36.77%) were annotated using BLASTx. Among them, 28,225 (35.75%) and 28,119 (35.62%) unigenes had homologs in the TrEMBL and NCBI non-redundant (Nr) databases, respectively. Of these unigenes, 21,140 were assigned to gene ontology classes, and a total of 11,990 unigenes were classified into 25 KOG functional categories. A total of 7,357 unigenes were annotated to 4,524 KOs, and 4,651 unigenes were mapped onto 342 KEGG pathways using BLAST comparison against the KEGG database. A total of 11,717 unigenes were differentially expressed (fold change>2) during the root induction stage, with 8,772 unigenes down-regulated and 2,945 unigenes up-regulated. A total of 12,737 unigenes were differentially expressed during the root initiation stage, with 9,303 unigenes down-regulated and 3,434 unigenes up-regulated. A total of 5,334 unigenes were differentially expressed between the root induction and initiation stage, with 2,167 unigenes down-regulated and 3,167 unigenes up-regulated. qRT-PCR validation of the 39 genes with known functions indicated a strong correlation (92.3%) with the RNA-Seq data. The GO enrichment, pathway mapping, and gene expression profiles reveal molecular traits for root induction and initiation. This study provides a platform for functional genomic research with this species.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Feargal Ryan (2018). NR Viral TrEMBL [Dataset]. http://doi.org/10.6084/m9.figshare.5822166.v1
Organization logoOrganization logo

NR Viral TrEMBL

Explore at:
bz2Available download formats
Dataset updated
Feb 9, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Feargal Ryan
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The viral subset of the TrEMBL database clustered at 95% identity at the amino acid level to remove redundancy.

Search
Clear search
Close search
Google apps
Main menu