zpn/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community
agemagician/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community
Databases which provide clustered sets of sequences from UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records are all displayed in the entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% (UniRef90) or 50% (UniRef50) sequence identity to the longest sequence (UniRef seed sequence). All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster.
The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt Knowledgebase (including isoforms) and selected UniParc records in order to obtain complete coverage of the sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view.
Various non-redundant databases with different sequence identity cut-offs created by clustering closely similar sequences to yield a representative subset of sequences. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases. The two major objectives of UniRef are: (i) to facilitate sequence merging in UniProt, and (ii) to allow faster and more informative sequence similarity searches. Although the UniProt Knowledgebase is much less redundant than UniParc, it still contains a certain level of redundancy because it is not possible to use fully automatic merging without risking merging of similar sequences from different proteins. However, such automatic procedures are extremely useful in compiling the UniRef databases to obtain complete coverage of sequence space while hiding redundant sequences (but not their descriptions) from view. A high level of redundancy results in several problems, including slow database searches and long lists of similar or identical alignments that can obscure novel matches in the output. Thus, a more even sampling of sequence space is advantageous. You may access NREF via the FTP server.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aBac, Cluster is mixed with bacterial proteins.bLength of cluster's seed protein.cAnalysis is based on phylogenetic tree and analyzing the expanded cluster according to UniRef50.dH2V, from host to virus. I.e., sequences acquired by the virus from a metazoan host. N.D. Unresolved; Cont, contamination; Frag, Fragment.
This dataset was created by Darien Schettler
All UniRef90 sequences of G-protein coupled receptors (GPCR) class proteins across all species. G-protein coupled receptors are evolutionarily related proteins and cell surface receptors that detect molecules outside the cell in Eukariotes. Contains both confirmed and putative proteins.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
description
This is a dataset download from UniRef50 database with sequence length ranging from 0 to 50
codes for the data mining (downloaded on September 30 2024) import requests query_url = 'https://rest.uniprot.org/uniref/stream?compressed=true&fields=id%2Clength%2Cidentity%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29%29+AND+%28identity%3A0.5%29' uniprot_request = requests.get(query_url) from io import BytesIO import pandas
bio = BytesIO(uniprot_request.content)
df =… See the full description on the dataset page: https://huggingface.co/datasets/dzjxzyd/UniRef50_len_0_50.
The TIGR database is a collection of plant transcript sequences. Transcript assemblies are searchable using BLAST and accession number. The construction of plant transcript assemblies (TAs) is similar to the TIGR gene indices. The sequences that are used to build the plant TAs are expressed transcripts collected from dbEST (ESTs) and the NCBI GenBank nucleotide database (full length and partial cDNAs). "Virtual" transcript sequences derived from whole genome annotation projects are not included. All plant species for which more than 1,000 ESTs or cDNA sequences are available are included in this project. TAs are clustered and assembled using the TGICL tool (Pertea et al., 2003), Megablast (Zhang et al., 2000) and the CAP3 assembler (Huang and Madan, 1999). TGICL is a wrapper script which invokes Megablast and CAP3. Sequences are initially clustered based on an all-against-all comparisons using Megablast. The initial clusters are assembled to generate consensus sequences using CAP3. Assembly criteria include a 50 bp minimum match, 95% minimum identity in the overlap region and 20 bp maximum unmatched overhangs. Any EST/cDNA sequences that are not assembled into TAs are included as singletons. All singletons retain their GenBank accession numbers as identifiers. Plant TA identifiers are of the form TAnumber_taxonID, where number is a unique numerical identifier of the transcript assembly and taxonID represents the NCBI taxon id. In order to provide annotation for the TAs, each TA/singleton was aligned to the UniProt Uniref database. For release 1 TAs, a masked version of the Uniref90 database was used. For release 2 and onwards, a masked version of the UniRef100 database is used. Alignments were required to have at least 20% identity and 20% coverage. The annotation for the protein with the best alignment to each TA or singleton was used as the annotation for that sequence. Additionally, the relative orientation of each TA/singleton to the best matching protein sequence was used to determine the orientation of each TA/singleton. Some sequences did not have alignments to the protein database that met our quality criteria, and those sequences have neither annotation nor orientation assignments. The release number for the plant TAs refers to the release version for a particular species. For the initial build, all TA sets are of version 1. Subsequent TA updates for new releases will be carried out when the percentage increase of the EST and cDNA counts exceeds 10% of the previous release and when the increase contains more than 1,000 new sequences. New releases will also include additional plant species with more than 1,000 EST or cDNA sequences that have become publicly available.
Trained ProteinBERT model weights for the paper "ProteinBERT: A universal deep-learning model of protein sequence and function". https://github.com/nadavbra/protein_bert Also available via FTP: ftp://ftp.cs.huji.ac.il/users/nadavb/protein_bert/epoch_92400_sample_23500000.pkl ProteinBERT is a protein language model pretrained on ~106M proteins from UniRef90. The pretrained model can be fine-tuned on any protein-related task in a matter of minutes. ProteinBERT achieves state-of-the-art performance on a wide range of benchmarks. ProteinBERT is built on Keras/TensorFlow. ProteinBERT's deep-learning architecture is inspired by BERT, but contains several innovations such as global-attention layers that have linear complexity for sequence length (compared to self-attention's quadratic/n^2 growth). As a result, the model can process protein sequences of almost any length, including extremely long protein sequences (of over tens of thousands of amino acids). The model takes protein sequences as inputs, and can also take protein GO annotations as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly). This pretrained Tensorflow/Keras model was produced by training for 28 days over ~670M records (~6.4 epochs over the entire UniRef90 training dataset of ~106M proteins).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used in the Platon manuscript/publication comprising:
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average number of hits used for generating PSSM profiles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metaclusters obtained from the DPCfam clustering of UniRef50, v. 2017_07. Metaclusters represent putative protein families automatically derived using the DPCfam method, as described in Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta.
Visit also https://dpcfam.areasciencepark.it/ to easily navigate the data.
VERSION 1.1 changes:
Added DPCfamB database, including all small metaclusters with 25<=N<50 seed sequences. DPCdamB files are named with the prefix B_
Added Alphafold representative based on AlphaFoldDB for each MC
FILES DESCRIPTION:
1) Standard DPCfam database
metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported
uniref50_annotated.xml.gz UniRef50 v.2017_07 database annotated with Pfam families and DPCfam metaclusters. A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. XML schema is derived from uniprot's UniRef50 xml schema.
2) DPCfamB database
B_metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. All metaclusters are listed. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
B_metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
B_metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
B_ all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison using independent tests.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains original single-end transcriptome outputs from the forward broken paired-end RNAseq of the false black coral Savalia savaglia (NCBI/BioProject accession: PRJNA1111802). The dataset includes the following files:
• Assembly_Ss_SE_Trinity.fastq.U.qtrim: Remaining unpaired reads from the forward broken paired-end RNAseq of Savalia savaglia.
• Assembly_Ss_SE.Trinity.fasta: Single-end transcriptome obtained with the Trinity Assembler from the forward broken paired-end RNAseq of Savalia savaglia.
• Assembly_Ss_SE_Trinity.fasta_stats.txt: Statistics summary of the single-end transcriptome of Savalia savaglia.
• Assembly_Ss_SE_Trinity.fasta.gene_trans_map: Transcript-to-gene mapping file generated during the assembly process.
• quant.sf: Salmon output containing expression values for the assembled transcripts.
Additionally, the dataset includes the following files from DIAMOND BLASTx analyses:
• UniRef90_SE.diamond.blastx.outfmt6: BLASTx output file against the UniRef90 database, reporting the top alignment for each query.
• UniRef90_SE.diamond.blastx.outfmt6.grouped: BLASTx hits grouped to improve sequence coverage by combining multiple high-scoring segment pairs (HSPs).
• UniRef90_SE.diamond.blastx.outfmt6.hist: Histogram summarizing the distribution of BLASTx hit lengths.
• UniRef90_SE.diamond.blastx.outfmt6.w_pct_hit_length: File providing percentages of hit lengths, including top hit's length and percent of the length covered in the alignment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains outputs generated from the original paired-end transcriptomic analyses of the false black coral Savalia savaglia. The dataset includes the following files:
• 75282_ID2093_3-SAS_S416_L004_R1_001.fastq.P.qtrim.zip: Preprocessed forward reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.
• 75282_ID2093_3-SAS_S416_L004_R2_001.fastq.P.qtrim.zip: Preprocessed reverse reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.
• Assembly_Ss_PE.Trinity.fasta: Original, non-filtered de novo paired-end transcriptome assembly of Savalia savaglia, generated from 68 million PE reads using the Trinity Assembler.
• Assembly_Ss_PE_Trinity.fasta_stats.txt: Statistical summary of the paired-end transcriptome assembly of Savalia savaglia.
• Assembly_Ss_PE_Trinity.fasta.gene_trans_map: Transcript-to-gene mapping file generated during the paired-end transcriptome assembly of Savalia savaglia.
• quant_Ss_PE.sf: Salmon output containing expression values for the assembled transcripts from the paired-end assembly of Savalia savaglia.
Additionally, the dataset includes the following files from DIAMOND BLASTx analyses, which used the original de novo paired-end transcriptome assembly of the false black coral Savalia savaglia:
• UniRef90_PE.diamond.blastx.outfmt6: BLASTx output file against the UniRef90 database, reporting the top alignment for each query (assembled transcripts) from the paired-end assembly of Savalia savaglia.
• UniRef90_PE.diamond.blastx.outfmt6.grouped: Grouped BLASTx hits from the paired-end assembly of Savalia savaglia, designed to improve sequence coverage by combining multiple high-scoring segment pairs (HSPs).
• UniRef90_PE.diamond.blastx.outfmt6.hist: Histogram summarizing the distribution of BLASTx hit lengths obtained from the paired-end assembly of Savalia savaglia.
• UniRef90_PE.diamond.blastx.outfmt6.w_pct_hit_length: File providing percentages of hit lengths from BLASTx analyses of the paired-end assembly of Savalia savaglia, including the top hit's length and the percent of the length covered in the alignment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit report of Uniref Sociedad Anonima Cerrada contains unique and detailed export import market intelligence with it's phone, email, Linkedin and details of each import and export shipment like product, quantity, price, buyer, supplier names, country and date of shipment.
zpn/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community