60 datasets found
  1. c

    Protein Structural Domain Classification

    • cathdb.info
    • ec.i4cologne.com
    • +3more
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
    Explore at:
    Dataset updated
    Sep 30, 2024
    Description

    CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

  2. d

    Alternative Splicing Annotation Project II Database

    • dknet.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.

  3. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  4. uniprot-database_(type_ko).27.09.2019.tab.rar

    • figshare.com
    application/x-rar
    Updated Jun 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Kumazawa Morais (2020). uniprot-database_(type_ko).27.09.2019.tab.rar [Dataset]. http://doi.org/10.6084/m9.figshare.12555422.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Jun 24, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Daniel Kumazawa Morais
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy

  5. n

    DAVID

    • neuinfo.org
    • dknet.org
    • +1more
    Updated Aug 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). DAVID [Dataset]. http://identifiers.org/RRID:SCR_001881
    Explore at:
    Dataset updated
    Aug 17, 2024
    Description

    Bioinformatics resource system including web server and web service for functional annotation and enrichment analyses of gene lists. Consists of comprehensive knowledgebase and set of functional analysis tools. Includes gene centered database integrating heterogeneous gene annotation resources to facilitate high throughput gene functional analysis., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.

  6. Databases for MyCodentifier: A tool for routine identification of...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Dec 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jodie A. Schildkraut; Jodie A. Schildkraut; Jordy P.M. Coolen; Jordy P.M. Coolen; Heleen Severin; Ellen Koenraad; Nicole Aalders; Willem J.G. Melchers; Wouter Hoefsloot; Wouter Hoefsloot; Heiman F.L. Wertheim; Heiman F.L. Wertheim; Jakko van Ingen; Jakko van Ingen; Heleen Severin; Ellen Koenraad; Nicole Aalders; Willem J.G. Melchers (2022). Databases for MyCodentifier: A tool for routine identification of nontuberculous mycobacteria using MGIT enriched shotgun metagenomics. [Dataset]. http://doi.org/10.5281/zenodo.7396289
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jodie A. Schildkraut; Jodie A. Schildkraut; Jordy P.M. Coolen; Jordy P.M. Coolen; Heleen Severin; Ellen Koenraad; Nicole Aalders; Willem J.G. Melchers; Wouter Hoefsloot; Wouter Hoefsloot; Heiman F.L. Wertheim; Heiman F.L. Wertheim; Jakko van Ingen; Jakko van Ingen; Heleen Severin; Ellen Koenraad; Nicole Aalders; Willem J.G. Melchers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Databases used for MyCodentifier a Nextflow pipeline to identify Mycobacterium tuberculosis complex (MTBC) and Nontuberculous mycobacteria (NTM) species from Next-generation sequencing (NGS) data.

    Short description:
    The pipeline is constructed using nextflow as workflow manager running in a docker container. It is able to identify species of MTBC/NTM from positive Mycobacterial Growth Indicator Tube (MGIT) cultures. To do so it uses an hsp65 database for fast identification coupled with a Metagenomic method using centrifuge to identify on genome level. For TB it also is able to identify subspecies. Results are presented in automated pdf and html reports.

    Databases
    NameShort Description
    20220726_ref.tar.gz7 major mycobacterial genomes as centrifuge classification database, used for reference-based mapping and genotype resistance prediction
    20220726_wgs_centrifuge_db_Radboudumc_MB.tar.gzcentrifuge classification database using Tortoli et al 2017 Mycobacterium strains + additional strains
    genomes.tar.gz7 major mycobacterial genomes, annotation and Genbank files. Files are paired with 20220726_ref.tar.gz
    snpEff.tar.gz7 major mycobacterial genomes annotation models for snpEff.
    Tortoli_etal_hsp65.tar.gzKMA database of hsp65 gene extractions of the Tortoli et al 2017 Mycobacterium strains.

    Used in the study:
    p_compressed+h+v.tar.gz (12/06/2016)

    Databases available via ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data or https://ccb.jhu.edu/software/centrifuge/manual.shtml#custom-database

    MyCodentifier Github:

    https://jordycoolen.github.io/MyCodentifier/

  7. GTDB r220 Mash Database (UNOFFICIAL MIRROR)

    • zenodo.org
    bin
    Updated Jun 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josh L. Espinoza; Josh L. Espinoza (2024). GTDB r220 Mash Database (UNOFFICIAL MIRROR) [Dataset]. http://doi.org/10.5281/zenodo.11494307
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Josh L. Espinoza; Josh L. Espinoza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an UNOFFICIAL host for the GTDB mash sketch based on GTDB r220

    Intended use of this file is to include in the VEBA database for quicker GTDB-Tk analysis.

    Created by running the following command using GTDB-Tk v2.4.0 on the S1 sample from Zenodo:7946802:

    gtdbtk classify_wf --genome_dir veba_output/binning/prokaryotic/S1/output/genomes/ --out_dir test_output -x fa --cpus 1 --mash_db ./gtdb_r220.msh

    Source Files:

    gtdbtk_r220_data.tar.gz

    RELEASE_NOTES.txt

    Release 220.0:
    --------------
    
    GTDB release R09-RS220 comprises 596,859 genomes organised into 113,104 species clusters. 
    Additional statistics for this release are available on the GTDB Statistics page.
    
    Release notes:
    --------------
    
     - Average nucleotide identity (ANI) between genomes is now calculated using skani (Shaw et al., Nat Methods, 2023) instead of FastANI (Jain et al, Nat Commun, 2018). 
      skani provides a substantial reduction in computational requirements while producing similar ANI values and more accurate alignment fraction (AF) values.
     - CheckM v2 information is included on the website and in the metadata files, noting at this stage that these data were not used for the QC step in release 220. 
     - Post-curation cycle, we identified updated spelling for 15 taxon names: 
      p_Calescibacterota (updated name: Calescibacteriota)
      c_Brachyspirae (updated name: Brachyspiria)
      c_Leptospirae (updated name: Leptospiria)
      o_Ammonifexales (updated name: Ammonificales)
      o_Exiguobacterales (updated name: Exiguobacteriales)
      o_Hydrogenedentiales (updated name: Hydrogenedentales)
      o_Phormidesmiales (updated name: Phormidesmidales)
      f_Arcanobacteraceae (updated name: Arcanibacteraceae)
      f_Acetonemaceae (updated name: Acetonemataceae)
      f_Ethanoligenenaceae (updated name: Ethanoligenentaceae)
      f_Exiguobacteraceae (updated name: Exiguobacteriaceae)
      f_Geitlerinemaceae (updated name: Geitlerinemataceae)
      f_Koribacteraceae (updated name: Korobacteraceae)
      f_Phormidesmiaceae (updated name: Phormidesmidaceae)
      f_Porisulfidaceae (updated name: Poriferisulfidaceae)
      Note that the LPSN linkouts point to the correct updated names. We encourage users to use the updated names as these will appear in the next release.
     - Post-curation cycle, we discovered that two provisionally named families, Nitrincolaceae and Denitrovibrionaceae have been validly named under the ICNP as Balneatricaceae and Geovibrionaceae, respectively. 
      We encourage users to use the validly published names as these will appear in the next release.
     - We thank Jan Mares for his assistance in curating the class Cyanobacteriia and Brian Kemish for providing IT support to the project.

    If you have found this useful, please cite the original publications:

  8. m

    GTDB_r89_54k

    • bridges.monash.edu
    • researchdata.edu.au
    tar
    Updated Jul 23, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Meric; Ryan Wick (2019). GTDB_r89_54k [Dataset]. http://doi.org/10.26180/5d369804283f0
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jul 23, 2019
    Dataset provided by
    Monash University
    Authors
    Guillaume Meric; Ryan Wick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of (compressed) index database files suitable for use with Centrifuge, Kraken1 and Kraken2 that can be used to classify metagenomes using the GTDB_r89_54k index. More information and details at: https://github.com/rrwick/Metagenomics-Index-Correction

  9. f

    Data from: Mass Spectrometry-Based Proteomics Combined with Bioinformatic...

    • acs.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacek P. Dworzanski; Samir V. Deshpande; Rui Chen; Rabih E. Jabbour; A. Peter Snyder; Charles H. Wick; Liang Li (2023). Mass Spectrometry-Based Proteomics Combined with Bioinformatic Tools for Bacterial Classification [Dataset]. http://doi.org/10.1021/pr050294t.s003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Jacek P. Dworzanski; Samir V. Deshpande; Rui Chen; Rabih E. Jabbour; A. Peter Snyder; Charles H. Wick; Liang Li
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Timely classification and identification of bacteria is of vital importance in many areas of public health. We present a mass spectrometry (MS)-based proteomics approach for bacterial classification. In this method, a bacterial proteome database is derived from all potential protein coding open reading frames (ORFs) found in 170 fully sequenced bacterial genomes. Amino acid sequences of tryptic peptides obtained by LC−ESI MS/MS analysis of the digest of bacterial cell extracts are assigned to individual bacterial proteomes in the database. Phylogenetic profiles of these peptides are used to create a matrix of sequence-to-bacterium assignments. These matrixes, viewed as specific assignment bitmaps, are analyzed using statistical tools to reveal the relatedness between a test bacterial sample and the microorganism database. It is shown that, if a sufficient amount of sequence information is obtained from the MS/MS experiments, a bacterial sample can be classified to a strain level by using this proteomics method, leading to its positive identification. Keywords: classification of bacteria • proteomics • tandem mass spectrometry • LC−MS/MS • bioinformatics

  10. Z

    GTDB r214.1 Mash Database (UNOFFICIAL MIRROR)

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josh L. Espinoza (2023). GTDB r214.1 Mash Database (UNOFFICIAL MIRROR) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8048186
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    J. Craig Venter Institute
    Authors
    Josh L. Espinoza
    License

    https://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt

    Description

    This is an UNOFFICIAL host for the GTDB mash sketch based on GTDB r214.1

    Intended use of this file is to include in the VEBA database for quicker GTDB-Tk analysis.

    Created by running the following command using GTDB-Tk v2.3.0 on the S1 sample from Zenodo:7946802:

    gtdbtk classify_wf --genome_dir veba_output/binning/prokaryotic/S1/output/genomes/ --out_dir test_output -x fa --cpus 1 --mash_db ./gtdb_r214.msh

    Source Files:

    gtdbtk_r214_data.tar.gz

    RELEASE_NOTES.txt

    Release Notes:

    Release 214.1:

    Correction regarding the classification of the genome "GB_GCA_902406375.1" in 214.1 release. We have identified an error in the taxonomy assignment for this particular genome.

    The genome GB_GCA_902406375.1 was previously classified as Collinsella sp905215505 in some files . We have reevaluated the taxonomy and determined that the correct classification should be Collinsella sp002232035. We have rectified this error and made the necessary updates to the following files within the package: - bac120_taxonomy_r214.tsv - sp_clusters_r214.tsv - ssu_all_r214.tar.gz

    Notes:

    • We thank Jan MareÅ¡ for his help in curating the Cyanobacteria

    • Phylum names have been updated following the valid publication of 42 names in IJSEM (https://pubmed.ncbi.nlm.nih.gov/34694987/), including Bacillota and Pseudomonadota

    • Fixed issue with SSU files where sequences started 2 bp after correct start and stopped 1 bp after correct end of sequence. Thanks to CX for bringing this issue to our attention: https://forum.gtdb.ecogenomic.org/t/16s-23s-and-ssu-all-r207/307/2

    • SSU files now provide sequences in their 5' to 3' orientation

    • Changed QC criterion for number of contigs from 1000 to 2000 in order to better align the GTDB criteria with RefSeq (https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/)

    • Changed QC criterion to use ar53 instead of ar122 marker set. The impact of this change was evaluated on the 353,569 genomes (~6,100 archaeal) considered for GTDB R207: -- only 1 additional genome passed QC -- only 21 additional genomes failed QC which included the following species representatives: -- s_Methanoregula sp002497485 -- s_Methanobrevibacter_A sp017634055 -- s_Methanosphaera sp003266165 -- s_MGIIa-L1 sp002688825 -- s_MGIIb-N2 sp002503665 -- s_MGIIa-L2 sp002692685 -- s_MGIIb-O3 sp002730445 -- s_DTDI01 sp011334935 -- s_Methanosphaera sp017652595 -- s_Nitrosopelagicus sp902606945 -- s_Methanolinea sp002501965

    If you have found this useful, please cite the original publications:

    Chaumeil PA, et al. 2022. GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database. Bioinformatics, btac672.

    Parks, D.H., et al. (2021). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50: D785–D794.

  11. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13908086
    Explore at:
    application/gzip, zip, bz2Available download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 31, 2024
    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).

    For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.

    For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.

    Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .

    CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.


    Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

    Changelog Version 5:

    • Add: ted_365m.domain_summary.cath.globularity.taxid.tsv.tar.gz - This table, in the same format as the previous ted_100_324m.domain_summary.cath.globularity.taxid.tsv.tar.gz, contains per-domain annotations for the whole of TED, including metadata on domain quality metrics such as secondary structure elements counts, globularity scores, average pLDDT and taxonomical assignments.
    • Add: high_symmetry_folds_set.domain_summary.tsv.gz - subset of ted_365m.domain_summary.cath.globularity.taxid.tsv containing information on 6,433 high symmetry folds in TED. The entries are sorted in descending order by Z-score obtained from SymD.
    • Add: high_symmetry_folds_set_models.tar.gz - TED domain models in PDB format for 6,433 high symmetry folds in TED.
    • Add: ISP_data.tar.gz - Raw data for Interacting SuperFamily Pairs calculations used in the manuscript. A more detailed description of the ISP data is available below as well as within the tar.gz file.
    • Add: ted_redundant_40m_domain_id.list.gz - list of TED_domain_ID in TED redundant
    • Add: ted_100_324m_domain_id.list.gz - list of TED_domain_ID in TED100
    • Fix/Replace: A domain-level summary of TED, now consolidated into ted_365m.domain_summary.cath.globularity.taxid.tsv, is consistent with the protocol used in the manuscript. As Foldclass and Foldseek T-level hits provide all 4 CATH digits, we removed the H portion of the CATH code from each prediction at the T-level.
      Previously, the following columns
      14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300
      15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment.
      16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass
      sometimes showed an additional label with a T-level prediction by Foldclass in the case of T-level assignments obtained by Foldseek, e.g.
      3.40.30,3.40.30 T foldseek,foldclass
      This has now been corrected to reflect the TED protocol, with Foldclass T-level assignments applied only to domains where a T-level assignment could not be applied using Foldseek, e.g.
      domain-x 3.40.30 T foldseek
      domain-y 3.20.20 T foldclass

      Thus, in the current version of the data, CATH assignments label can only be
      H-level assignment by Foldseek (i.e. 3.40.50.300 H foldseek)
      T-level assignment by Foldseek (i.e. 3.40.30 T foldseek)
      T-level assignment by Foldclass (i.e. 3.40.30 T foldclass)
      or no assignment (- - - )


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m_domain_id.list.gz - list of ~324 million domain identifiers in TED100, one per line in the format AF-
    • ted_redundant_40m_domain_id.list.gz - list of ~40 million domain identifiers in TED redundant, one per line in the format AF-
    • ted_365m.domain_summary.cath.globularity.taxid.tsv, novel_folds_set.domain_summary.tsv and high_symmetry_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). novel_folds_set.domain_summary.tsv is sorted by novelty
      Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

      1. ted_id - TED domain identifier in the format AF-
    • ted_324m_seq_clustering.cathlabels.tsv.gz
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz
    • The file ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz contains a header with the

  12. n

    SUPFAM

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Nov 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SUPFAM [Dataset]. http://identifiers.org/RRID:SCR_005304
    Explore at:
    Dataset updated
    Nov 14, 2024
    Description

    SUPFAM is a database that consists of clusters of potentially related homologous protein domain families, with and without three-dimensional structural information, forming superfamilies. The present release (Release 3.0) of SUPFAM uses homologous families in Pfam (Version 23.0) and SCOP (Release 1.69) which are examples of sequence -alignment and structure classification databases respectively. The two steps involved in setting up of SUPFAM database are * Relating Pfam and SCOP families using a new profile-profile alignment algorithm AlignHUSH. This results in identifying many Pfam families which could be related to a family or superfamily of known structural information. * An all-against-all match among Pfam families with yet unknown structure resulting in identification of related Pfam families forming new potential superfamilies. The SUPFAM database can be used in either the Browse mode or Search mode. In Browse mode you can browse through the Superfamilies, Pfam families or SCOP families. In each of these modes you will be presented with a full list which can be easily browsed. In Search mode, you can search for Pfam families, SCOP families or Superfamilies based on keywords or SCOP/Pfam identifiers of families and superfamilies., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.

  13. f

    DECIPHER (SILVA_r132) training set for classification

    • figshare.com
    xz
    Updated Jun 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Trivedi (2020). DECIPHER (SILVA_r132) training set for classification [Dataset]. http://doi.org/10.6084/m9.figshare.12443522.v1
    Explore at:
    xzAvailable download formats
    Dataset updated
    Jun 7, 2020
    Dataset provided by
    figshare
    Authors
    Christopher Trivedi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a link to the (previous) DECIPHER (http://www2.decipher.codes/Downloads.html) SILVA_r132 training set since it has been updated to the SIVLA_r138 training set on their website.This is for use in an amplicon training workflow as part of the Bioinformatics Virtual Coordination Network (BVCN; https://biovcnet.github.io/). The tutorial in question can be found on the BVCN github - https://github.com/biovcnet/topic-amplicons/tree/master/Lesson03b.

  14. Z

    Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathur, Arpit; Anam, Karishma; Gawde, Vaibhav; Terse, Vishram; Bhanshe, Prasanna; Joshi, Swapnali; Chaudhary, Shruti; Chatterjee, Gaurav; Rajpal, Sweta; Tembhare, Prashant; Mirgh, Sumeet; Shetty, Alok; Punatar, Sachin; Nayak, Lingaraj; Jain, Hasmukh; Sengar, Manju; Bagal, Bhausaheb; Subramanian, PG; Gujral, Sumeet; Jindal, Nishant; Shetty, Dhanalaxmi; Khattry, Navin; Gokarn, Anant; Patkar, Nikhil (2024). Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with Acute Leukemia and Sepsis, Including Those Undergoing Bone Marrow Transplantation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13356510
    Explore at:
    Dataset updated
    Aug 24, 2024
    Dataset provided by
    Advanced Centre for Treatment, Research and Education in Cancer
    Tata Memorial Hospital
    Authors
    Mathur, Arpit; Anam, Karishma; Gawde, Vaibhav; Terse, Vishram; Bhanshe, Prasanna; Joshi, Swapnali; Chaudhary, Shruti; Chatterjee, Gaurav; Rajpal, Sweta; Tembhare, Prashant; Mirgh, Sumeet; Shetty, Alok; Punatar, Sachin; Nayak, Lingaraj; Jain, Hasmukh; Sengar, Manju; Bagal, Bhausaheb; Subramanian, PG; Gujral, Sumeet; Jindal, Nishant; Shetty, Dhanalaxmi; Khattry, Navin; Gokarn, Anant; Patkar, Nikhil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Next Generation Sequencing (NGS) analysis of Cell-Free DNA provides valuable insights into a spectrum of pathogenic species (particularly bacterial) in blood. Patients with Sepsis often face problems like delays in treatment regimens (combination or cocktail of antibiotics) due to the long turnaround time (TAT) of classical and standard blood culture procedures. NGS gives results with lower TAT along with high-depth coverage. The use of NGS may be a possible solution to deciding treatment regimens for patients without losing precious time and more accurately possibly saving lives.

    Our curated dataset is of bacterial species or strains detected along with their genome size in 107 AML patients diagnosed with Sepsis clinically. Cell-free DNA profiles of patients were built and sequencing was done in Illumina (NovaSeq and NextSeq). Bioinformatic analysis was performed using two classification algorithms namely kraken2 and kaiju. For kraken2 based classification reference bacterial index developed by Carlo Ferravante et al (Zenodo 2020) (link: https://zenodo.org/records/4055180) was used, while for kaiju-based classification reference database named "nr_euk" dated "2023-05-10" (link: https://bioinformatics-centre.github.io/kaiju/downloads.html) was used.

    Genome size annotation is important in metagenomics since for the use of depth of coverage (abundance), genome size is required. In metagenomic classification algorithms like kraken/kraken2 and kaiju output computes reads assigned only and not abundance. In kaiju, the problem is more complicated since the reference database does not have a fasta file but only an index file from which alignment is done.

    To address the above challenges to compute "depth of coverage" or simply abundance, we build a Genome size annotator tool (https://github.com/patkarlab/Genome-Size-Annotation) which provides genome size for each species detected given its taxid is available. In this tool, the NCBI Datasets tool, NCBI Genome API check tool, and Data Mining from AI search engines like perplexity.ai are used.

    We have curated two datasets

    Kraken2 dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kraken_genome_annotation"Kaiju dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kaiju_genome_annotation"

    *Please note that for kraken2 curated dataset, we used data mining from the AI search engine perplexity.ai while for kaiju we did not use perplexity, ai, and any species whose genome size was not found was labeled "NA"

  15. m

    Data from: A novel protein motif finding algorithm for classification of the...

    • bridges.monash.edu
    • researchdata.edu.au
    pdf
    Updated Nov 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun, Deng-Kuan; Zhang, Tong-Liang; Ding, Yong-Sheng (2017). A novel protein motif finding algorithm for classification of the ligase subfamilies [Dataset]. http://doi.org/10.4225/03/5a1371c69c0e3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 21, 2017
    Dataset provided by
    Monash University
    Authors
    Sun, Deng-Kuan; Zhang, Tong-Liang; Ding, Yong-Sheng
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    The algorithm of extracting motifs from a family or subfamily is still a hot spot in bioinformatics. It not only contributes to understand functions of proteins and predicts the classification which a unknown protein sequence belongs to, but also helps to study the protein-protein interaction. In this paper, we present a novel algorithm to extract motifs of a subfamily, which is based on feature selection and position connection. Position connection is applied to generate motifs, which is the hybrid method with mechanism of vote decision-making to construct the classifier of the ligase subfamilies. Through testing in the database, more than 95.87% predictive accuracy is achieved. The result demonstrates that this novel method is practical. In addition, the method illuminates that motifs play an important role to classify proteins and research the characteristics of the subfamilies or families of protein database. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

    Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

  16. Datasets for Lupo et al. (2022) An extended reservoir of class-D...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Feb 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valérian Lupo; Denis BAURAIN; Frédéric Kerff (2022). Datasets for Lupo et al. (2022) An extended reservoir of class-D beta-lactamases in non-clinical bacterial strains [Dataset]. http://doi.org/10.6084/m9.figshare.18544955.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 15, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Valérian Lupo; Denis BAURAIN; Frédéric Kerff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Lupo et al. 2022: Archive content for v2Overview...

    17 directories, 182 files README.md: this file.command-line.sh: examples of bash commands to use or generate the files stored in this archive.biosampleThis directory contains input and output files used to assign a “clinical” score to a BioSample report (…)bldb_oxaFile in .fasta format of the reference OXA-family sequences from the Beta-lactamase Database (BLDB) used for annotation with the annotate.pl perl script from Bio::MUST modules.genetic_environmentThis directory contains the list of bacterial assembly download links in .csv format to provide to GeneSpy and the list of contig accession numbers to download with the command-line efetch tool from the NCBI E-utilities.local_refseq_dbThe list of the assembly accession numbers of the local RefSeq database built on 7th of December 2017.ncbi_pathogenThis directory contains consolidated FASTA (.fasta) and TSV (.tab) files downloaded from the NCBI Pathogen Detection server (ftp://ftp.ncbi.nlm.nih.gov/pathogen/):all-prot-nr.fastaall_bla.tabIt also contains files associated to class-D beta-lactamases (…)oxa_familyThis directory contains the FASTA file bla_d.fasta with the 24,916 OXA-family protein selected with the ompa-pa.pl script and its deduplicated file clst95_bla_d.fasta and also the coordinates file class_d98.bb and the sequence accession identifier file class_d98.idl from ompa-pa.pl.alignmentsThree alignments of OXA-family proteins are available (…)treeThe mapper.idm is a TSV file that contains the short and corresponding long sequence identifiers used to rename sequences for booster and RAxML tree.boosterThis directory contains raw output files obtained from the booster web server in NEWICK format. boosterweb_tbe_norm.nhis the final tree file.consenseConsensus tree computed with consense (PHYLIP package) using the 100 replicate trees of RAxMLRAxML_bootstrap.classd-final-edit_188-RAXML-PROTGAMMALGF-100xRAPIDBP.raxmlThis directory contains raw output files of RAxML in NEWICK format, computed from the reduced alignment classd-final-edit_188.fasta.oxa_family_clustersThis directory contains alignment files in FASTA format and the corresponding .hmm profile files for non-singleton clusters (representative sequences) (…)oxa_family_domainsThe 3510 unique OXA-family sequences and their corresponding taxonomy are available in FASTA format 3510_bla.fastaand TSV format 3510_bla.tax (…)phylogenetic_clusteringThis directory contains a templatized R script mcl.script.R.tt used to compute phylogenetic clustering, the ladderized rooted OXA-family tree used by the R script and its associated traits file.scriptsThis directory contains various perl scripts (…)sql_dbThis directory contains the SQL files for the results database (…)taxdump-20180208Mirror of the NCBI Taxonomy used in this study (downloaded on 8th of February 2018).

  17. InpactorDB: A Plant classified lineage-level LTR retrotransposon reference...

    • zenodo.org
    zip
    Updated Mar 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo; Guyot Romain; Guyot Romain; Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo (2022). InpactorDB: A Plant classified lineage-level LTR retrotransposon reference library for free-alignment methods based on Machine Learning [Dataset]. http://doi.org/10.5281/zenodo.4386317
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo; Guyot Romain; Guyot Romain; Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LTR retrotransposons are mobile elements that make up the major part of most plant genomes. Their identification and annotation via bioinformatics approaches represent a major challenge in the era of massive plant genome sequencing. In addition to their involvement in the variation in genome size, these elements are also associated in the function and structure of different chromosomal regions and in the alteration of the function of coding regions, among others. Several plant retrotransposon sequence databases of LTR retrotransposons are available with public access such as PGSB, RepetDB or restricted access such as Repbase. Although they are useful for approaches to identify LTR-RTs in new genomes by similarity, the elements of these databases are not classified down to the lineage/family level. with great depth.

    Here, we present InpactorDB a semi-curated dataset composed of 130,511 elements from 195 plant genomes (belonging to 108 plant species), classified down to the lineage level. This data set has been used to train two deep neural networks (one fully connected and one convolutional) for fast classification of elements. Used in lineage-level classification approaches, we obtain a score above 98% of F1-score, precision and recall.

    In order to classify elements of the ‘LTR_STRUC’ and ‘EDTA’ datasets, we used the methodology proposed by Inpactor, which uses homology-based strategy with known coding domains belonging to LTR-RTs. We utilized the RexDB domain library as reference. LTR-RTs were classified into superfamilies, Gypsy (RLG) or Copia (RLC) and sub-classified into lineages according to the similarities of five different amino acid reference domains (GAG, AP, RT, RNAseH, and INT domains). In addition, we applied filters to remove keep only intact elements:

    1) to remove predicted elements with domains from two different superfamilies (i.e. Gypsy and Copia),

    2) or elements with domains belonging to two or more different lineages,

    3) to remove elements with lengths different than those reported by Gypsy Database (GyDB) with a tolerance of 20%,

    4) to delete incomplete elements which has less than three identified domains, and

    5) to remove elements with insertions of TE class II (reported in Repbase).

    The final non-redundant version of InpactorDB consists of 67,305 LTR retrotransposons. Both redundant and non-redundant versions of InpactorDB are available in Fasta format in which sequences have identifiers with the following general Identification code:

    >Superfamily-Lineage-plant_family-specie-source-length-ID,

    Where Superfamily can is either RLC (for Copia) or RLG (for Gypsy), Lineage/family follows following the RexDB nomenclature, source (can be Repbase, RepetDB, PGSB, LTR_STRUC or EDTA datasets), length, and ID, is a unique number which identify each element inside the InpactorDB.

  18. The hybrid database of VITAP and related taxonomic assignments of IMG/VR...

    • figshare.com
    zip
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaiyang Zheng (2024). The hybrid database of VITAP and related taxonomic assignments of IMG/VR (v.4) vOTUs [Dataset]. http://doi.org/10.6084/m9.figshare.25426159.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Kaiyang Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The hybrid database of VITAP and related taxonomic assignments of IMG/VR (v.4) vOTUs (https://github.com/DrKaiyangZheng/VITAP).

  19. Removing contaminants from databases of draft genomes

    • plos.figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jennifer Lu; Steven L. Salzberg (2023). Removing contaminants from databases of draft genomes [Dataset]. http://doi.org/10.1371/journal.pcbi.1006277
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jennifer Lu; Steven L. Salzberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.

  20. Z

    Human Disease Ontology 2018 update: classification, content and workflow...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quentin St.Charles (2023). Human Disease Ontology 2018 update: classification, content and workflow expansion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8083644
    Explore at:
    Dataset updated
    Jun 29, 2023
    Authors
    Quentin St.Charles
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    ABSTRACT:

    The Human Disease Ontology (DO) (http://www.disease-ontology.org), database has undergone significant expansion in the past three years. The DO disease classification includes specific formal semantic rules to express meaningful disease models and has expanded from a single asserted classification to include multiple-inferred mechanistic disease classifications, thus providing novel perspectives on related diseases. Expansion of disease terms, alternative anatomy, cell type and genetic disease classifications and workflow automation highlight the updates for the DO since 2015. The enhanced breadth and depth of the DO's knowledgebase has expanded the DO's utility for exploring the multi-etiology of human disease, thus improving the capture and communication of health-related data across biomedical databases, bioinformatics tools, genomic and cancer resources and demonstrated by a 6.6× growth in DO's user community since 2015. The DO's continual integration of human disease knowledge, evidenced by the more than 200 SVN/GitHub releases/revisions, since previously reported in our DO 2015 NAR paper, includes the addition of 2650 new disease terms, a 30% increase of textual definitions, and an expanding suite of disease classification hierarchies constructed through defined logical axioms.

    Instructions:

    Data was cleaned. Duplicates and unnecessary columns were removed. Title of columns were changed.

    Inspiration:

    This dataset uploaded to U-BRITE for "DRG_DEPOT" summer 2023 team project.

    Acknowledgements:

    Schriml, L. M., Mitraka, E., Munro, J., Tauber, B., Schor, M., Nickle, L., Felix, V., Jeng, L., Bearer, C., Lichenstein, R., Bisordi, K., Campion, N., Hyman, B., Kurland, D., Oates, C. P., Kibbey, S., Sreekumar, P., Le, C., Giglio, M., & Greene, C.

    Human Disease Ontology 2018 update: classification, content and workflow expansion

    Nucleic Acids Research 2019; 47(D1), D955–D962;PMID:30407550;DOI:https://doi.org/10.1093/nar/gky1032

    U-BRITE last update data: 06/28/2023

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005

Protein Structural Domain Classification

Explore at:
Dataset updated
Sep 30, 2024
Description

CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

Search
Clear search
Close search
Google apps
Main menu