53 datasets found
  1. h

    uniref90

    • huggingface.co
    Updated Apr 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Elnaggar (2022). uniref90 [Dataset]. https://huggingface.co/datasets/agemagician/uniref90
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2022
    Authors
    Ahmed Elnaggar
    Description

    agemagician/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    uniref90

    • huggingface.co
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zach Nussbaum (2023). uniref90 [Dataset]. https://huggingface.co/datasets/zpn/uniref90
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Authors
    Zach Nussbaum
    Description

    zpn/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. n

    UniRef

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). UniRef [Dataset]. http://identifiers.org/RRID:SCR_010646
    Explore at:
    Dataset updated
    Nov 16, 2024
    Description

    Databases which provide clustered sets of sequences from UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records are all displayed in the entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% (UniRef90) or 50% (UniRef50) sequence identity to the longest sequence (UniRef seed sequence). All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster.

  4. uniref90

    • kaggle.com
    zip
    Updated Nov 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Fan (2023). uniref90 [Dataset]. https://www.kaggle.com/datasets/zhfanrui/uniref90
    Explore at:
    zip(42899039362 bytes)Available download formats
    Dataset updated
    Nov 5, 2023
    Authors
    Stephen Fan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Stephen Fan

    Released under CC0: Public Domain

    Contents

    20231104

  5. List of UniRef90 clusters that include mammals and dsDNA viruses (Class I).

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadav Rappoport; Michal Linial (2023). List of UniRef90 clusters that include mammals and dsDNA viruses (Class I). [Dataset]. http://doi.org/10.1371/journal.pcbi.1002364.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Nadav Rappoport; Michal Linial
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    aBac, Cluster is mixed with bacterial proteins.bLength of cluster's seed protein.cAnalysis is based on phylogenetic tree and analyzing the expanded cluster according to UniRef50.dH2V, from host to virus. I.e., sequences acquired by the virus from a metazoan host. N.D. Unresolved; Cont, contamination; Frag, Fragment.

  6. u

    CAT/BAT uniref90+algae proteins from NCBI

    • figshare.unimelb.edu.au
    • figshare.com
    bin
    Updated Sep 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuhao Tong (2025). CAT/BAT uniref90+algae proteins from NCBI [Dataset]. http://doi.org/10.26188/27990278.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 25, 2025
    Dataset provided by
    The University of Melbourne
    Authors
    Yuhao Tong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the packaged CAT/BAT database, storing all amino acid sequences from Uniref90 as well as ~440,000 algal chloroplast sequences from NCBI nucleotide database. Before running ChloroScan, please download this package, unzip it and pass the tax/ and db/ within the directory as part of parameters.

  7. d

    UniRef at the EBI

    • dknet.org
    • scicrunch.org
    • +1more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). UniRef at the EBI [Dataset]. http://identifiers.org/RRID:SCR_004972
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Various non-redundant databases with different sequence identity cut-offs created by clustering closely similar sequences to yield a representative subset of sequences. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases. The two major objectives of UniRef are: (i) to facilitate sequence merging in UniProt, and (ii) to allow faster and more informative sequence similarity searches. Although the UniProt Knowledgebase is much less redundant than UniParc, it still contains a certain level of redundancy because it is not possible to use fully automatic merging without risking merging of similar sequences from different proteins. However, such automatic procedures are extremely useful in compiling the UniRef databases to obtain complete coverage of sequence space while hiding redundant sequences (but not their descriptions) from view. A high level of redundancy results in several problems, including slow database searches and long lists of similar or identical alignments that can obscure novel matches in the output. Thus, a more even sampling of sequence space is advantageous. You may access NREF via the FTP server.

  8. UniProt UniRef90

    • kaggle.com
    zip
    Updated Oct 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2022). UniProt UniRef90 [Dataset]. https://www.kaggle.com/datasets/dschettler8845/uniprot-uniref90
    Explore at:
    zip(36601227955 bytes)Available download formats
    Dataset updated
    Oct 4, 2022
    Authors
    Darien Schettler
    Description

    Dataset

    This dataset was created by Darien Schettler

    Contents

  9. esm2_uniref_pretraining_data

    • huggingface.co
    Updated Nov 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NVIDIA (2025). esm2_uniref_pretraining_data [Dataset]. https://huggingface.co/datasets/nvidia/esm2_uniref_pretraining_data
    Explore at:
    Dataset updated
    Nov 24, 2025
    Dataset provided by
    Nvidiahttp://nvidia.com/
    Authors
    NVIDIA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ESM-2 Uniref Pretraining Data

      Dataset Description:
    

    UniRef, or UniProt Reference Clusters, are databases of clustered protein sequences from the UniProt Knowledgebase (UniProtKB) that group similar sequences to reduce redundancy and make data easier to work with for biological research. It offers different levels of clustering (UniRef100, UniRef90, and UniRef50) based on sequence identity, with each cluster containing a representative sequence, a count of member proteins… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/esm2_uniref_pretraining_data.

  10. h

    UniRef90_len_0_50

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhenjiao Du, UniRef90_len_0_50 [Dataset]. https://huggingface.co/datasets/dzjxzyd/UniRef90_len_0_50
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Zhenjiao Du
    Description

    This is a dataset download from UniRef90 database with sequence length ranging from 0 to 50

    codes for the data mining (downloaded on September 30 2024) import requests query_url = 'https://rest.uniprot.org/uniref/stream?compressed=true&fields=id%2Clength%2Cidentity%2Csequence&format=tsv&query=%28%28length%3A%5B*+TO+50%5D%29%29+AND+%28identity%3A0.9%29' uniprot_request = requests.get(query_url) from io import BytesIO import pandas

    bio = BytesIO(uniprot_request.content)

    df =… See the full description on the dataset page: https://huggingface.co/datasets/dzjxzyd/UniRef90_len_0_50.

  11. n

    TIGR Plant Transcript Assembly database

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). TIGR Plant Transcript Assembly database [Dataset]. http://identifiers.org/RRID:SCR_005470
    Explore at:
    Dataset updated
    Nov 28, 2024
    Description

    The TIGR database is a collection of plant transcript sequences. Transcript assemblies are searchable using BLAST and accession number. The construction of plant transcript assemblies (TAs) is similar to the TIGR gene indices. The sequences that are used to build the plant TAs are expressed transcripts collected from dbEST (ESTs) and the NCBI GenBank nucleotide database (full length and partial cDNAs). "Virtual" transcript sequences derived from whole genome annotation projects are not included. All plant species for which more than 1,000 ESTs or cDNA sequences are available are included in this project. TAs are clustered and assembled using the TGICL tool (Pertea et al., 2003), Megablast (Zhang et al., 2000) and the CAP3 assembler (Huang and Madan, 1999). TGICL is a wrapper script which invokes Megablast and CAP3. Sequences are initially clustered based on an all-against-all comparisons using Megablast. The initial clusters are assembled to generate consensus sequences using CAP3. Assembly criteria include a 50 bp minimum match, 95% minimum identity in the overlap region and 20 bp maximum unmatched overhangs. Any EST/cDNA sequences that are not assembled into TAs are included as singletons. All singletons retain their GenBank accession numbers as identifiers. Plant TA identifiers are of the form TAnumber_taxonID, where number is a unique numerical identifier of the transcript assembly and taxonID represents the NCBI taxon id. In order to provide annotation for the TAs, each TA/singleton was aligned to the UniProt Uniref database. For release 1 TAs, a masked version of the Uniref90 database was used. For release 2 and onwards, a masked version of the UniRef100 database is used. Alignments were required to have at least 20% identity and 20% coverage. The annotation for the protein with the best alignment to each TA or singleton was used as the annotation for that sequence. Additionally, the relative orientation of each TA/singleton to the best matching protein sequence was used to determine the orientation of each TA/singleton. Some sequences did not have alignments to the protein database that met our quality criteria, and those sequences have neither annotation nor orientation assignments. The release number for the plant TAs refers to the release version for a particular species. For the initial build, all TA sets are of version 1. Subsequent TA updates for new releases will be carried out when the percentage increase of the EST and cDNA counts exceeds 10% of the previous release and when the increase contains more than 1,000 new sequences. New releases will also include additional plant species with more than 1,000 EST or cDNA sequences that have become publicly available.

  12. uniref example

    • kaggle.com
    zip
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    team93 (2025). uniref example [Dataset]. https://www.kaggle.com/datasets/team93/uniref-example
    Explore at:
    zip(2805668 bytes)Available download formats
    Dataset updated
    Feb 23, 2025
    Authors
    team93
    Description

    Dataset

    This dataset was created by team93

    Contents

  13. Performance comparison using independent tests.

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar (2023). Performance comparison using independent tests. [Dataset]. http://doi.org/10.1371/journal.pone.0158445.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison using independent tests.

  14. Number of interface and non-interface residues in RB198, RB44, and RB111...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar (2023). Number of interface and non-interface residues in RB198, RB44, and RB111 datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0158445.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of interface and non-interface residues in RB198, RB44, and RB111 datasets.

  15. f

    Evolutionary aspects of functionally relevant homodimers exhibiting global...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 22, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Srikeerthana, Kuchi; Srinivasan, Narayanaswamy; Swapna, Lakshmipuram Seshadri (2012). Evolutionary aspects of functionally relevant homodimers exhibiting global asymmetry. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001143814
    Explore at:
    Dataset updated
    May 22, 2012
    Authors
    Srikeerthana, Kuchi; Srinivasan, Narayanaswamy; Swapna, Lakshmipuram Seshadri
    Description

    Note: Unless indicated by * all homologous sequences have been gathered from Uniref50 database. If very few homologues are identified then homologues identified from Uniref90 database (indicated by *) are used in the analysis. In a few PDB entries, several molecules are present. The dimeric molecule under consideration is highlighted using italics.

  16. h

    ESM2nv_Uniref_Training_Data_hf

    • huggingface.co
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederick Hoffman (2025). ESM2nv_Uniref_Training_Data_hf [Dataset]. https://huggingface.co/datasets/frdddy/ESM2nv_Uniref_Training_Data_hf
    Explore at:
    Dataset updated
    Sep 27, 2025
    Authors
    Frederick Hoffman
    Description

    ESM2nv UniRef Training Data (Streaming)

    This dataset provides streaming access to UniRef sequences used for NVIDIA ESM2nv pretraining.

      Contents
    

    name=default (split train): UniRef90 representatives + members corresponding to UniRef50 train reps. name=validation (split validation): UniRef50 validation representatives. name=train_reps (split train, optional): UniRef50 training representatives only.

      Features
    

    text (str): amino-acid sequence id (str): header… See the full description on the dataset page: https://huggingface.co/datasets/frdddy/ESM2nv_Uniref_Training_Data_hf.

  17. Z

    Metaclusters by DPCfam clustering of UniRef50 v 2017_07

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Oct 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tea Russo; Federico Barone (2022). Metaclusters by DPCfam clustering of UniRef50 v 2017_07 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5877585
    Explore at:
    Dataset updated
    Oct 30, 2022
    Dataset provided by
    Sissa, Trieste (IT); Area Science Park, Trieste (IT)
    Area Science Park, Trieste (IT)
    Authors
    Elena Tea Russo; Federico Barone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metaclusters obtained from the DPCfam clustering of UniRef50, v. 2017_07. Metaclusters represent putative protein families automatically derived using the DPCfam method, as described in Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta.

    Visit also https://dpcfam.areasciencepark.it/ to easily navigate the data.

    VERSION 1.1 changes:

    Added DPCfamB database, including all small metaclusters with 25<=N<50 seed sequences. DPCdamB files are named with the prefix B_

    Added Alphafold representative based on AlphaFoldDB for each MC

    FILES DESCRIPTION:

    1) Standard DPCfam database

    metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.

    metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .

    metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .

    all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported

    uniref50_annotated.xml.gz UniRef50 v.2017_07 database annotated with Pfam families and DPCfam metaclusters. A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. XML schema is derived from uniprot's UniRef50 xml schema.

    2) DPCfamB database

    B_metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. All metaclusters are listed. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.

    B_metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .

    B_metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .

    B_ all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported

  18. m

    Original paired-end transcriptome outputs from the RNAseq analyses of the...

    • data.mendeley.com
    Updated Oct 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dany Domínguez Pérez (2024). Original paired-end transcriptome outputs from the RNAseq analyses of the false black coral Savalia savaglia [Dataset]. http://doi.org/10.17632/7t36p2dvjp.2
    Explore at:
    Dataset updated
    Oct 10, 2024
    Authors
    Dany Domínguez Pérez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains outputs generated from the original paired-end transcriptomic analyses of the false black coral Savalia savaglia. The dataset includes the following files:

    • 75282_ID2093_3-SAS_S416_L004_R1_001.fastq.P.qtrim.zip: Preprocessed forward reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.

    • 75282_ID2093_3-SAS_S416_L004_R2_001.fastq.P.qtrim.zip: Preprocessed reverse reads used for the novo assembly, obtained from the paired-end RNA-Seq data of Savalia savaglia, containing high-quality sequences that passed quality trimming and filtering using Trimmomatic.

    • Assembly_Ss_PE.Trinity.fasta: Original, non-filtered de novo paired-end transcriptome assembly of Savalia savaglia, generated from 68 million PE reads using the Trinity Assembler.

    • Assembly_Ss_PE_Trinity.fasta_stats.txt: Statistical summary of the paired-end transcriptome assembly of Savalia savaglia.

    • Assembly_Ss_PE_Trinity.fasta.gene_trans_map: Transcript-to-gene mapping file generated during the paired-end transcriptome assembly of Savalia savaglia.

    • quant_Ss_PE.sf: Salmon output containing expression values for the assembled transcripts from the paired-end assembly of Savalia savaglia.

    Additionally, the dataset includes the following files from DIAMOND BLASTx analyses, which used the original de novo paired-end transcriptome assembly of the false black coral Savalia savaglia:

    • UniRef90_PE.diamond.blastx.outfmt6: BLASTx output file against the UniRef90 database, reporting the top alignment for each query (assembled transcripts) from the paired-end assembly of Savalia savaglia.

    • UniRef90_PE.diamond.blastx.outfmt6.grouped: Grouped BLASTx hits from the paired-end assembly of Savalia savaglia, designed to improve sequence coverage by combining multiple high-scoring segment pairs (HSPs).

    • UniRef90_PE.diamond.blastx.outfmt6.hist: Histogram summarizing the distribution of BLASTx hit lengths obtained from the paired-end assembly of Savalia savaglia.

    • UniRef90_PE.diamond.blastx.outfmt6.w_pct_hit_length: File providing percentages of hit lengths from BLASTx analyses of the paired-end assembly of Savalia savaglia, including the top hit's length and the percent of the length covered in the alignment.

  19. s

    TIGR Plant Transcript Assembly database

    • scicrunch.org
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). TIGR Plant Transcript Assembly database [Dataset]. http://identifiers.org/RRID:SCR_005470
    Explore at:
    Dataset updated
    Jun 20, 2024
    Description

    The TIGR database is a collection of plant transcript sequences. Transcript assemblies are searchable using BLAST and accession number. The construction of plant transcript assemblies (TAs) is similar to the TIGR gene indices. The sequences that are used to build the plant TAs are expressed transcripts collected from dbEST (ESTs) and the NCBI GenBank nucleotide database (full length and partial cDNAs). "Virtual" transcript sequences derived from whole genome annotation projects are not included. All plant species for which more than 1,000 ESTs or cDNA sequences are available are included in this project. TAs are clustered and assembled using the TGICL tool (Pertea et al., 2003), Megablast (Zhang et al., 2000) and the CAP3 assembler (Huang and Madan, 1999). TGICL is a wrapper script which invokes Megablast and CAP3. Sequences are initially clustered based on an all-against-all comparisons using Megablast. The initial clusters are assembled to generate consensus sequences using CAP3. Assembly criteria include a 50 bp minimum match, 95% minimum identity in the overlap region and 20 bp maximum unmatched overhangs. Any EST/cDNA sequences that are not assembled into TAs are included as singletons. All singletons retain their GenBank accession numbers as identifiers. Plant TA identifiers are of the form TAnumber_taxonID, where number is a unique numerical identifier of the transcript assembly and taxonID represents the NCBI taxon id. In order to provide annotation for the TAs, each TA/singleton was aligned to the UniProt Uniref database. For release 1 TAs, a masked version of the Uniref90 database was used. For release 2 and onwards, a masked version of the UniRef100 database is used. Alignments were required to have at least 20% identity and 20% coverage. The annotation for the protein with the best alignment to each TA or singleton was used as the annotation for that sequence. Additionally, the relative orientation of each TA/singleton to the best matching protein sequence was used to determine the orientation of each TA/singleton. Some sequences did not have alignments to the protein database that met our quality criteria, and those sequences have neither annotation nor orientation assignments. The release number for the plant TAs refers to the release version for a particular species. For the initial build, all TA sets are of version 1. Subsequent TA updates for new releases will be carried out when the percentage increase of the EST and cDNA counts exceeds 10% of the previous release and when the increase contains more than 1,000 new sequences. New releases will also include additional plant species with more than 1,000 EST or cDNA sequences that have become publicly available.

  20. PSI-BLAST memory usage.

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar (2023). PSI-BLAST memory usage. [Dataset]. http://doi.org/10.1371/journal.pone.0158445.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yasser EL-Manzalawy; Mostafa Abbas; Qutaibah Malluhi; Vasant Honavar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PSI-BLAST memory usage.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ahmed Elnaggar (2022). uniref90 [Dataset]. https://huggingface.co/datasets/agemagician/uniref90

uniref90

agemagician/uniref90

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2022
Authors
Ahmed Elnaggar
Description

agemagician/uniref90 dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu