42 datasets found

h
uniref30
huggingface.co
Updated Jan 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Elnaggar (2025). uniref30 [Dataset]. https://huggingface.co/datasets/agemagician/uniref30
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 17, 2025
Authors
Ahmed Elnaggar
Description
agemagician/uniref30 dataset hosted on Hugging Face and contributed by the HF Datasets community
d
UniRef
dknet.org
scicrunch.org
+1more
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). UniRef [Dataset]. http://identifiers.org/RRID:SCR_010646
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_010646
Dataset updated
Aug 9, 2024
Description
Databases which provide clustered sets of sequences from UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records are all displayed in the entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% (UniRef90) or 50% (UniRef50) sequence identity to the longest sequence (UniRef seed sequence). All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster.
i
UniRef
registry.identifiers.org
bioregistry.io
Updated Mar 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). UniRef [Dataset]. https://registry.identifiers.org/registry/uniref#!
Explore at:
Dataset updated
Mar 27, 2023
Description
The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt Knowledgebase (including isoforms) and selected UniParc records in order to obtain complete coverage of the sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view.
n
UniRef at the EBI
neuinfo.org
scicrunch.org
+1more
Updated Jun 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). UniRef at the EBI [Dataset]. http://identifiers.org/RRID:SCR_004972
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004972
Dataset updated
Jun 3, 2024
Description
Various non-redundant databases with different sequence identity cut-offs created by clustering closely similar sequences to yield a representative subset of sequences. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases. The two major objectives of UniRef are: (i) to facilitate sequence merging in UniProt, and (ii) to allow faster and more informative sequence similarity searches. Although the UniProt Knowledgebase is much less redundant than UniParc, it still contains a certain level of redundancy because it is not possible to use fully automatic merging without risking merging of similar sequences from different proteins. However, such automatic procedures are extremely useful in compiling the UniRef databases to obtain complete coverage of sequence space while hiding redundant sequences (but not their descriptions) from view. A high level of redundancy results in several problems, including slow database searches and long lists of similar or identical alignments that can obscure novel matches in the output. Thus, a more even sampling of sequence space is advantageous. You may access NREF via the FTP server.
List of UniRef90 clusters that include mammals and dsDNA viruses (Class I).
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nadav Rappoport; Michal Linial (2023). List of UniRef90 clusters that include mammals and dsDNA viruses (Class I). [Dataset]. http://doi.org/10.1371/journal.pcbi.1002364.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1002364.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nadav Rappoport; Michal Linial
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
aBac, Cluster is mixed with bacterial proteins.bLength of cluster's seed protein.cAnalysis is based on phylogenetic tree and analyzing the expanded cluster according to UniRef50.dH2V, from host to virus. I.e., sequences acquired by the virus from a metazoan host. N.D. Unresolved; Cont, contamination; Frag, Fragment.
r
UniRef
rrid.site
Updated Jul 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_010646
Dataset updated
Jul 12, 2025
Description
Databases which provide clustered sets of sequences from UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records are all displayed in the entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% (UniRef90) or 50% (UniRef50) sequence identity to the longest sequence (UniRef seed sequence). All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster.
h
UniRef90_len_0_50
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhenjiao Du, UniRef90_len_0_50 [Dataset]. https://huggingface.co/datasets/dzjxzyd/UniRef90_len_0_50
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Zhenjiao Du
Description
This is a dataset download from UniRef90 database with sequence length ranging from 0 to 50

codes for the data mining (downloaded on September 30 2024) import requests query_url = 'https://rest.uniprot.org/uniref/stream?compressed=true&fields=id%2Clength%2Cidentity%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29%29+AND+%28identity%3A0.9%29' uniprot_request = requests.get(query_url) from io import BytesIO import pandas

bio = BytesIO(uniprot_request.content)

df =… See the full description on the dataset page: https://huggingface.co/datasets/dzjxzyd/UniRef90_len_0_50.
u
CAT/BAT uniref90+algae proteins from NCBI
figshare.unimelb.edu.au
bin
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuhao Tong (2024). CAT/BAT uniref90+algae proteins from NCBI [Dataset]. http://doi.org/10.26188/27990278.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.26188/27990278.v2
Dataset updated
Dec 9, 2024
Dataset provided by
The University of Melbourne
Authors
Yuhao Tong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
description
d
ProteinBERT Trained model
dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ofer, Dan; Brandes, Nadav (2023). ProteinBERT Trained model [Dataset]. http://doi.org/10.7910/DVN/HI55J5
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/HI55J5
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Ofer, Dan; Brandes, Nadav
Description
Trained ProteinBERT model weights for the paper "ProteinBERT: A universal deep-learning model of protein sequence and function". https://github.com/nadavbra/protein_bert Also available via FTP: ftp://ftp.cs.huji.ac.il/users/nadavb/protein_bert/epoch_92400_sample_23500000.pkl ProteinBERT is a protein language model pretrained on ~106M proteins from UniRef90. The pretrained model can be fine-tuned on any protein-related task in a matter of minutes. ProteinBERT achieves state-of-the-art performance on a wide range of benchmarks. ProteinBERT is built on Keras/TensorFlow. ProteinBERT's deep-learning architecture is inspired by BERT, but contains several innovations such as global-attention layers that have linear complexity for sequence length (compared to self-attention's quadratic/n^2 growth). As a result, the model can process protein sequences of almost any length, including extremely long protein sequences (of over tens of thousands of amino acids). The model takes protein sequences as inputs, and can also take protein GO annotations as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly). This pretrained Tensorflow/Keras model was produced by training for 28 days over ~670M records (~6.4 epochs over the entire UniRef90 training dataset of ~106M proteins).
n
TIGR Plant Transcript Assembly database
neuinfo.org
dknet.org
+1more
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). TIGR Plant Transcript Assembly database [Dataset]. http://identifiers.org/RRID:SCR_005470
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005470
Dataset updated
Jun 20, 2024
Description
The TIGR database is a collection of plant transcript sequences. Transcript assemblies are searchable using BLAST and accession number. The construction of plant transcript assemblies (TAs) is similar to the TIGR gene indices. The sequences that are used to build the plant TAs are expressed transcripts collected from dbEST (ESTs) and the NCBI GenBank nucleotide database (full length and partial cDNAs). "Virtual" transcript sequences derived from whole genome annotation projects are not included. All plant species for which more than 1,000 ESTs or cDNA sequences are available are included in this project. TAs are clustered and assembled using the TGICL tool (Pertea et al., 2003), Megablast (Zhang et al., 2000) and the CAP3 assembler (Huang and Madan, 1999). TGICL is a wrapper script which invokes Megablast and CAP3. Sequences are initially clustered based on an all-against-all comparisons using Megablast. The initial clusters are assembled to generate consensus sequences using CAP3. Assembly criteria include a 50 bp minimum match, 95% minimum identity in the overlap region and 20 bp maximum unmatched overhangs. Any EST/cDNA sequences that are not assembled into TAs are included as singletons. All singletons retain their GenBank accession numbers as identifiers. Plant TA identifiers are of the form TAnumber_taxonID, where number is a unique numerical identifier of the transcript assembly and taxonID represents the NCBI taxon id. In order to provide annotation for the TAs, each TA/singleton was aligned to the UniProt Uniref database. For release 1 TAs, a masked version of the Uniref90 database was used. For release 2 and onwards, a masked version of the UniRef100 database is used. Alignments were required to have at least 20% identity and 20% coverage. The annotation for the protein with the best alignment to each TA or singleton was used as the annotation for that sequence. Additionally, the relative orientation of each TA/singleton to the best matching protein sequence was used to determine the orientation of each TA/singleton. Some sequences did not have alignments to the protein database that met our quality criteria, and those sequences have neither annotation nor orientation assignments. The release number for the plant TAs refers to the release version for a particular species. For the initial build, all TA sets are of version 1. Subsequent TA updates for new releases will be carried out when the percentage increase of the EST and cDNA counts exceeds 10% of the previous release and when the increase contains more than 1,000 new sequences. New releases will also include additional plant species with more than 1,000 EST or cDNA sequences that have become publicly available.
Number of protein sequences in UniRef100 database and its variants.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar (2023). Number of protein sequences in UniRef100 database and its variants. [Dataset]. http://doi.org/10.1371/journal.pone.0158445.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0158445.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of protein sequences in UniRef100 database and its variants.
o
Platon RDS dataset
explore.openaire.eu
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Schwengers; Patrick Barth; Linda Falgenhauer; Torsten Hain; Trinad Chakraborty; Alexander Goesmann (2020). Platon RDS dataset [Dataset]. http://doi.org/10.5281/zenodo.3759169
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3759169
Dataset updated
Apr 21, 2020
Authors
Oliver Schwengers; Patrick Barth; Linda Falgenhauer; Torsten Hain; Trinad Chakraborty; Alexander Goesmann
Description
This dataset was used in the Platon manuscript/publication comprising: chromosome sequences plasmid sequences UniRef90 bacterial representative protein sequences UniRef90 protein / chromosome & plasmid hit counts artificial contigs RDS threhsold metrics
FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues...
plos.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar (2023). FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues [Dataset]. http://doi.org/10.1371/journal.pone.0158445
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0158445
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.
CAMI2_FunctionalAnnotation_BenchmarkSet
zenodo.org
bin
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Turck; Jonathan Turck (2025). CAMI2_FunctionalAnnotation_BenchmarkSet [Dataset]. http://doi.org/10.5281/zenodo.15192200
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15192200
Dataset updated
Apr 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonathan Turck; Jonathan Turck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Paired protein sequences in FASTA format and enzyme commission (EC) labels generated from the CAMI 2 Toy Human Microbiome Project gold standard assemblies using Prodigal and DIAMOND to the UniRef90 database.
UniProt
registry.opendata.aws
Updated Apr 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SIB Swiss Institute of Bioinformatics on behalf of the UniProt Consortium (2021). UniProt [Dataset]. https://registry.opendata.aws/uniprot/
Explore at:
Dataset updated
Apr 6, 2021
Dataset provided by
UniProthttp://www.uniprot.org/
Description
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.
Z
Metaclusters by DPCfam clustering of UniRef50 v 2017_07
data.niaid.nih.gov
zenodo.org
Updated Oct 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Tea Russo (2022). Metaclusters by DPCfam clustering of UniRef50 v 2017_07 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5877585
Explore at:
Dataset updated
Oct 30, 2022
Dataset provided by
Elena Tea Russo
Federico Barone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metaclusters obtained from the DPCfam clustering of UniRef50, v. 2017_07. Metaclusters represent putative protein families automatically derived using the DPCfam method, as described in Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta.

Visit also https://dpcfam.areasciencepark.it/ to easily navigate the data.

VERSION 1.1 changes:

Added DPCfamB database, including all small metaclusters with 25<=N<50 seed sequences. DPCdamB files are named with the prefix B_

Added Alphafold representative based on AlphaFoldDB for each MC

FILES DESCRIPTION:

1) Standard DPCfam database

metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.

metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .

metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .

all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported

uniref50_annotated.xml.gz UniRef50 v.2017_07 database annotated with Pfam families and DPCfam metaclusters. A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. XML schema is derived from uniprot's UniRef50 xml schema.

2) DPCfamB database

B_metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. All metaclusters are listed. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.

B_metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .

B_metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .

B_ all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported
h
uniref90_parquets_shuffled
huggingface.co
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jude Wells (2025). uniref90_parquets_shuffled [Dataset]. https://huggingface.co/datasets/judewells/uniref90_parquets_shuffled
Explore at:
Dataset updated
Mar 30, 2025
Authors
Jude Wells
Description
UniRef90 parquet files created by Jude Wells 2025-03-18 see script data_creation_scripts/shuffling/shuffle_uniref90.sh for shuffling it was done Processing was done on kaspian computer.
Average number of hits used for generating PSSM profiles.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar (2023). Average number of hits used for generating PSSM profiles. [Dataset]. http://doi.org/10.1371/journal.pone.0158445.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0158445.t007
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Average number of hits used for generating PSSM profiles.
m
Original paired-end transcriptome outputs from the RNAseq analyses of the...
data.mendeley.com
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dany Domínguez Pérez (2024). Original paired-end transcriptome outputs from the RNAseq analyses of the false coral Savalia savaglia [Dataset]. http://doi.org/10.17632/7t36p2dvjp.1
Explore at:
Unique identifier
https://doi.org/10.17632/7t36p2dvjp.1
Dataset updated
Jul 22, 2024
Authors
Dany Domínguez Pérez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains outputs generated from the original paired-end transcriptomic analyses of the false coral Savalia savaglia. The dataset includes the following files:

• 75282_ID2093_3-SAS_S416_L004_R1_001.fastq.P.qtrim.zip: Preprocessed forward reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.

• 75282_ID2093_3-SAS_S416_L004_R2_001.fastq.P.qtrim.zip: Preprocessed reverse reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.

• Assembly_Ss_PE.Trinity.fasta: Original, non-filtered de novo paired-end transcriptome assembly of Savalia savaglia, generated from 68 million PE reads using the Trinity Assembler.

• Assembly_Ss_PE_Trinity.fasta_stats.txt: Statistical summary of the paired-end transcriptome assembly of Savalia savaglia.

• Assembly_Ss_PE_Trinity.fasta.gene_trans_map: Transcript-to-gene mapping file generated during the paired-end transcriptome assembly of Savalia savaglia.

• quant_Ss_PE.sf: Salmon output containing expression values for the assembled transcripts from the paired-end assembly of Savalia savaglia.

Additionally, the dataset includes the following files from DIAMOND BLASTx analyses, which used the original de novo paired-end transcriptome assembly of the false coral Savalia savaglia:

• UniRef90_PE.diamond.blastx.outfmt6: BLASTx output file against the UniRef90 database, reporting the top alignment for each query (assembled transcripts) from the paired-end assembly of Savalia savaglia.

• UniRef90_PE.diamond.blastx.outfmt6.grouped: Grouped BLASTx hits from the paired-end assembly of Savalia savaglia, designed to improve sequence coverage by combining multiple high-scoring segment pairs (HSPs).

• UniRef90_PE.diamond.blastx.outfmt6.hist: Histogram summarizing the distribution of BLASTx hit lengths obtained from the paired-end assembly of Savalia savaglia.

• UniRef90_PE.diamond.blastx.outfmt6.w_pct_hit_length: File providing percentages of hit lengths from BLASTx analyses of the paired-end assembly of Savalia savaglia, including the top hit's length and the percent of the length covered in the alignment.
f
Number of interface and non-interface residues in RB198, RB44, and RB111...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar (2023). Number of interface and non-interface residues in RB198, RB44, and RB111 datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0158445.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0158445.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of interface and non-interface residues in RB198, RB44, and RB111 datasets.