Facebook
Twitteragemagician/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterzpn/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDatabases which provide clustered sets of sequences from UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records are all displayed in the entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% (UniRef90) or 50% (UniRef50) sequence identity to the longest sequence (UniRef seed sequence). All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Stephen Fan
Released under CC0: Public Domain
20231104
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aBac, Cluster is mixed with bacterial proteins.bLength of cluster's seed protein.cAnalysis is based on phylogenetic tree and analyzing the expanded cluster according to UniRef50.dH2V, from host to virus. I.e., sequences acquired by the virus from a metazoan host. N.D. Unresolved; Cont, contamination; Frag, Fragment.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the packaged CAT/BAT database, storing all amino acid sequences from Uniref90 as well as ~440,000 algal chloroplast sequences from NCBI nucleotide database. Before running ChloroScan, please download this package, unzip it and pass the tax/ and db/ within the directory as part of parameters.
Facebook
TwitterVarious non-redundant databases with different sequence identity cut-offs created by clustering closely similar sequences to yield a representative subset of sequences. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases. The two major objectives of UniRef are: (i) to facilitate sequence merging in UniProt, and (ii) to allow faster and more informative sequence similarity searches. Although the UniProt Knowledgebase is much less redundant than UniParc, it still contains a certain level of redundancy because it is not possible to use fully automatic merging without risking merging of similar sequences from different proteins. However, such automatic procedures are extremely useful in compiling the UniRef databases to obtain complete coverage of sequence space while hiding redundant sequences (but not their descriptions) from view. A high level of redundancy results in several problems, including slow database searches and long lists of similar or identical alignments that can obscure novel matches in the output. Thus, a more even sampling of sequence space is advantageous. You may access NREF via the FTP server.
Facebook
TwitterThis dataset was created by Darien Schettler
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ESM-2 Uniref Pretraining Data
Dataset Description:
UniRef, or UniProt Reference Clusters, are databases of clustered protein sequences from the UniProt Knowledgebase (UniProtKB) that group similar sequences to reduce redundancy and make data easier to work with for biological research. It offers different levels of clustering (UniRef100, UniRef90, and UniRef50) based on sequence identity, with each cluster containing a representative sequence, a count of member proteins… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/esm2_uniref_pretraining_data.
Facebook
TwitterThis is a dataset download from UniRef90 database with sequence length ranging from 0 to 50
codes for the data mining (downloaded on September 30 2024) import requests query_url = 'https://rest.uniprot.org/uniref/stream?compressed=true&fields=id%2Clength%2Cidentity%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29%29+AND+%28identity%3A0.9%29' uniprot_request = requests.get(query_url) from io import BytesIO import pandas
bio = BytesIO(uniprot_request.content)
df =… See the full description on the dataset page: https://huggingface.co/datasets/dzjxzyd/UniRef90_len_0_50.
Facebook
TwitterThe TIGR database is a collection of plant transcript sequences. Transcript assemblies are searchable using BLAST and accession number. The construction of plant transcript assemblies (TAs) is similar to the TIGR gene indices. The sequences that are used to build the plant TAs are expressed transcripts collected from dbEST (ESTs) and the NCBI GenBank nucleotide database (full length and partial cDNAs). "Virtual" transcript sequences derived from whole genome annotation projects are not included. All plant species for which more than 1,000 ESTs or cDNA sequences are available are included in this project. TAs are clustered and assembled using the TGICL tool (Pertea et al., 2003), Megablast (Zhang et al., 2000) and the CAP3 assembler (Huang and Madan, 1999). TGICL is a wrapper script which invokes Megablast and CAP3. Sequences are initially clustered based on an all-against-all comparisons using Megablast. The initial clusters are assembled to generate consensus sequences using CAP3. Assembly criteria include a 50 bp minimum match, 95% minimum identity in the overlap region and 20 bp maximum unmatched overhangs. Any EST/cDNA sequences that are not assembled into TAs are included as singletons. All singletons retain their GenBank accession numbers as identifiers. Plant TA identifiers are of the form TAnumber_taxonID, where number is a unique numerical identifier of the transcript assembly and taxonID represents the NCBI taxon id. In order to provide annotation for the TAs, each TA/singleton was aligned to the UniProt Uniref database. For release 1 TAs, a masked version of the Uniref90 database was used. For release 2 and onwards, a masked version of the UniRef100 database is used. Alignments were required to have at least 20% identity and 20% coverage. The annotation for the protein with the best alignment to each TA or singleton was used as the annotation for that sequence. Additionally, the relative orientation of each TA/singleton to the best matching protein sequence was used to determine the orientation of each TA/singleton. Some sequences did not have alignments to the protein database that met our quality criteria, and those sequences have neither annotation nor orientation assignments. The release number for the plant TAs refers to the release version for a particular species. For the initial build, all TA sets are of version 1. Subsequent TA updates for new releases will be carried out when the percentage increase of the EST and cDNA counts exceeds 10% of the previous release and when the increase contains more than 1,000 new sequences. New releases will also include additional plant species with more than 1,000 EST or cDNA sequences that have become publicly available.
Facebook
TwitterThis dataset was created by team93
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison using independent tests.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of interface and non-interface residues in RB198, RB44, and RB111 datasets.
Facebook
TwitterNote: Unless indicated by * all homologous sequences have been gathered from Uniref50 database. If very few homologues are identified then homologues identified from Uniref90 database (indicated by *) are used in the analysis. In a few PDB entries, several molecules are present. The dimeric molecule under consideration is highlighted using italics.
Facebook
TwitterESM2nv UniRef Training Data (Streaming)
This dataset provides streaming access to UniRef sequences used for NVIDIA ESM2nv pretraining.
Contents
name=default (split train): UniRef90 representatives + members corresponding to UniRef50 train reps. name=validation (split validation): UniRef50 validation representatives. name=train_reps (split train, optional): UniRef50 training representatives only.
Features
text (str): amino-acid sequence id (str): header… See the full description on the dataset page: https://huggingface.co/datasets/frdddy/ESM2nv_Uniref_Training_Data_hf.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metaclusters obtained from the DPCfam clustering of UniRef50, v. 2017_07. Metaclusters represent putative protein families automatically derived using the DPCfam method, as described in Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta.
Visit also https://dpcfam.areasciencepark.it/ to easily navigate the data.
VERSION 1.1 changes:
Added DPCfamB database, including all small metaclusters with 25<=N<50 seed sequences. DPCdamB files are named with the prefix B_
Added Alphafold representative based on AlphaFoldDB for each MC
FILES DESCRIPTION:
1) Standard DPCfam database
metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported
uniref50_annotated.xml.gz UniRef50 v.2017_07 database annotated with Pfam families and DPCfam metaclusters. A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. XML schema is derived from uniprot's UniRef50 xml schema.
2) DPCfamB database
B_metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. All metaclusters are listed. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
B_metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
B_metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
B_ all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains outputs generated from the original paired-end transcriptomic analyses of the false black coral Savalia savaglia. The dataset includes the following files:
• 75282_ID2093_3-SAS_S416_L004_R1_001.fastq.P.qtrim.zip: Preprocessed forward reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.
• 75282_ID2093_3-SAS_S416_L004_R2_001.fastq.P.qtrim.zip: Preprocessed reverse reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.
• Assembly_Ss_PE.Trinity.fasta: Original, non-filtered de novo paired-end transcriptome assembly of Savalia savaglia, generated from 68 million PE reads using the Trinity Assembler.
• Assembly_Ss_PE_Trinity.fasta_stats.txt: Statistical summary of the paired-end transcriptome assembly of Savalia savaglia.
• Assembly_Ss_PE_Trinity.fasta.gene_trans_map: Transcript-to-gene mapping file generated during the paired-end transcriptome assembly of Savalia savaglia.
• quant_Ss_PE.sf: Salmon output containing expression values for the assembled transcripts from the paired-end assembly of Savalia savaglia.
Additionally, the dataset includes the following files from DIAMOND BLASTx analyses, which used the original de novo paired-end transcriptome assembly of the false black coral Savalia savaglia:
• UniRef90_PE.diamond.blastx.outfmt6: BLASTx output file against the UniRef90 database, reporting the top alignment for each query (assembled transcripts) from the paired-end assembly of Savalia savaglia.
• UniRef90_PE.diamond.blastx.outfmt6.grouped: Grouped BLASTx hits from the paired-end assembly of Savalia savaglia, designed to improve sequence coverage by combining multiple high-scoring segment pairs (HSPs).
• UniRef90_PE.diamond.blastx.outfmt6.hist: Histogram summarizing the distribution of BLASTx hit lengths obtained from the paired-end assembly of Savalia savaglia.
• UniRef90_PE.diamond.blastx.outfmt6.w_pct_hit_length: File providing percentages of hit lengths from BLASTx analyses of the paired-end assembly of Savalia savaglia, including the top hit's length and the percent of the length covered in the alignment.
Facebook
TwitterThe TIGR database is a collection of plant transcript sequences. Transcript assemblies are searchable using BLAST and accession number. The construction of plant transcript assemblies (TAs) is similar to the TIGR gene indices. The sequences that are used to build the plant TAs are expressed transcripts collected from dbEST (ESTs) and the NCBI GenBank nucleotide database (full length and partial cDNAs). "Virtual" transcript sequences derived from whole genome annotation projects are not included. All plant species for which more than 1,000 ESTs or cDNA sequences are available are included in this project. TAs are clustered and assembled using the TGICL tool (Pertea et al., 2003), Megablast (Zhang et al., 2000) and the CAP3 assembler (Huang and Madan, 1999). TGICL is a wrapper script which invokes Megablast and CAP3. Sequences are initially clustered based on an all-against-all comparisons using Megablast. The initial clusters are assembled to generate consensus sequences using CAP3. Assembly criteria include a 50 bp minimum match, 95% minimum identity in the overlap region and 20 bp maximum unmatched overhangs. Any EST/cDNA sequences that are not assembled into TAs are included as singletons. All singletons retain their GenBank accession numbers as identifiers. Plant TA identifiers are of the form TAnumber_taxonID, where number is a unique numerical identifier of the transcript assembly and taxonID represents the NCBI taxon id. In order to provide annotation for the TAs, each TA/singleton was aligned to the UniProt Uniref database. For release 1 TAs, a masked version of the Uniref90 database was used. For release 2 and onwards, a masked version of the UniRef100 database is used. Alignments were required to have at least 20% identity and 20% coverage. The annotation for the protein with the best alignment to each TA or singleton was used as the annotation for that sequence. Additionally, the relative orientation of each TA/singleton to the best matching protein sequence was used to determine the orientation of each TA/singleton. Some sequences did not have alignments to the protein database that met our quality criteria, and those sequences have neither annotation nor orientation assignments. The release number for the plant TAs refers to the release version for a particular species. For the initial build, all TA sets are of version 1. Subsequent TA updates for new releases will be carried out when the percentage increase of the EST and cDNA counts exceeds 10% of the previous release and when the increase contains more than 1,000 new sequences. New releases will also include additional plant species with more than 1,000 EST or cDNA sequences that have become publicly available.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PSI-BLAST memory usage.
Facebook
Twitteragemagician/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community