60 datasets found

c
Protein Structural Domain Classification
cathdb.info
ec.i4cologne.com
+3more
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
Explore at:
Unique identifier
https://identifiers.org/MIR:00100005
Dataset updated
Sep 30, 2024
Description
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
d
Alternative Splicing Annotation Project II Database
dknet.org
scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000322
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
uniprot-database_(type_ko).27.09.2019.tab.rar
figshare.com
application/x-rar
Updated Jun 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Kumazawa Morais (2020). uniprot-database_(type_ko).27.09.2019.tab.rar [Dataset]. http://doi.org/10.6084/m9.figshare.12555422.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12555422.v1
Dataset updated
Jun 24, 2020
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Daniel Kumazawa Morais
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy
n
DAVID
neuinfo.org
dknet.org
+1more
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). DAVID [Dataset]. http://identifiers.org/RRID:SCR_001881
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001881
Dataset updated
Aug 17, 2024
Description
Bioinformatics resource system including web server and web service for functional annotation and enrichment analyses of gene lists. Consists of comprehensive knowledgebase and set of functional analysis tools. Includes gene centered database integrating heterogeneous gene annotation resources to facilitate high throughput gene functional analysis., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.

Databases for MyCodentifier: A tool for routine identification of...

zenodo.org
data.niaid.nih.gov

application/gzip

Updated Dec 9, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Jodie A. Schildkraut; Jodie A. Schildkraut; Jordy P.M. Coolen; Jordy P.M. Coolen; Heleen Severin; Ellen Koenraad; Nicole Aalders; Willem J.G. Melchers; Wouter Hoefsloot; Wouter Hoefsloot; Heiman F.L. Wertheim; Heiman F.L. Wertheim; Jakko van Ingen; Jakko van Ingen; Heleen Severin; Ellen Koenraad; Nicole Aalders; Willem J.G. Melchers (2022). Databases for MyCodentifier: A tool for routine identification of nontuberculous mycobacteria using MGIT enriched shotgun metagenomics. [Dataset]. http://doi.org/10.5281/zenodo.7396289

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7396289

Dataset updated

Dec 9, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Databases used for MyCodentifier a Nextflow pipeline to identify Mycobacterium tuberculosis complex (MTBC) and Nontuberculous mycobacteria (NTM) species from Next-generation sequencing (NGS) data.

Short description:
The pipeline is constructed using nextflow as workflow manager running in a docker container. It is able to identify species of MTBC/NTM from positive Mycobacterial Growth Indicator Tube (MGIT) cultures. To do so it uses an hsp65 database for fast identification coupled with a Metagenomic method using centrifuge to identify on genome level. For TB it also is able to identify subspecies. Results are presented in automated pdf and html reports.

**Databases**
Name	Short Description
20220726_ref.tar.gz	7 major mycobacterial genomes as centrifuge classification database, used for reference-based mapping and genotype resistance prediction
20220726_wgs_centrifuge_db_Radboudumc_MB.tar.gz	centrifuge classification database using Tortoli et al 2017 Mycobacterium strains + additional strains
genomes.tar.gz	7 major mycobacterial genomes, annotation and Genbank files. Files are paired with 20220726_ref.tar.gz
snpEff.tar.gz	7 major mycobacterial genomes annotation models for snpEff.
Tortoli_etal_hsp65.tar.gz	KMA database of hsp65 gene extractions of the Tortoli et al 2017 Mycobacterium strains.
Used in the study: p_compressed+h+v.tar.gz (12/06/2016)	Databases available via ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data or https://ccb.jhu.edu/software/centrifuge/manual.shtml#custom-database

MyCodentifier Github:

https://jordycoolen.github.io/MyCodentifier/

GTDB r220 Mash Database (UNOFFICIAL MIRROR)

zenodo.org

bin

Updated Jun 5, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Josh L. Espinoza; Josh L. Espinoza (2024). GTDB r220 Mash Database (UNOFFICIAL MIRROR) [Dataset]. http://doi.org/10.5281/zenodo.11494307

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11494307

Dataset updated

Jun 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Josh L. Espinoza; Josh L. Espinoza

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an UNOFFICIAL host for the GTDB mash sketch based on GTDB r220

Intended use of this file is to include in the VEBA database for quicker GTDB-Tk analysis.

Created by running the following command using GTDB-Tk v2.4.0 on the S1 sample from Zenodo:7946802:

gtdbtk classify_wf --genome_dir veba_output/binning/prokaryotic/S1/output/genomes/ --out_dir test_output -x fa --cpus 1 --mash_db ./gtdb_r220.msh

Source Files:

gtdbtk_r220_data.tar.gz

RELEASE_NOTES.txt

Release 220.0:
--------------

GTDB release R09-RS220 comprises 596,859 genomes organised into 113,104 species clusters. 
Additional statistics for this release are available on the GTDB Statistics page.

Release notes:
--------------

 - Average nucleotide identity (ANI) between genomes is now calculated using skani (Shaw et al., Nat Methods, 2023) instead of FastANI (Jain et al, Nat Commun, 2018). 
  skani provides a substantial reduction in computational requirements while producing similar ANI values and more accurate alignment fraction (AF) values.
 - CheckM v2 information is included on the website and in the metadata files, noting at this stage that these data were not used for the QC step in release 220. 
 - Post-curation cycle, we identified updated spelling for 15 taxon names: 
  p_Calescibacterota (updated name: Calescibacteriota)
  c_Brachyspirae (updated name: Brachyspiria)
  c_Leptospirae (updated name: Leptospiria)
  o_Ammonifexales (updated name: Ammonificales)
  o_Exiguobacterales (updated name: Exiguobacteriales)
  o_Hydrogenedentiales (updated name: Hydrogenedentales)
  o_Phormidesmiales (updated name: Phormidesmidales)
  f_Arcanobacteraceae (updated name: Arcanibacteraceae)
  f_Acetonemaceae (updated name: Acetonemataceae)
  f_Ethanoligenenaceae (updated name: Ethanoligenentaceae)
  f_Exiguobacteraceae (updated name: Exiguobacteriaceae)
  f_Geitlerinemaceae (updated name: Geitlerinemataceae)
  f_Koribacteraceae (updated name: Korobacteraceae)
  f_Phormidesmiaceae (updated name: Phormidesmidaceae)
  f_Porisulfidaceae (updated name: Poriferisulfidaceae)
  Note that the LPSN linkouts point to the correct updated names. We encourage users to use the updated names as these will appear in the next release.
 - Post-curation cycle, we discovered that two provisionally named families, Nitrincolaceae and Denitrovibrionaceae have been validly named under the ICNP as Balneatricaceae and Geovibrionaceae, respectively. 
  We encourage users to use the validly published names as these will appear in the next release.
 - We thank Jan Mares for his assistance in curating the class Cyanobacteriia and Brian Kemish for providing IT support to the project.

If you have found this useful, please cite the original publications:

Chaumeil PA, et al. 2022. GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database. Bioinformatics, btac672.
Parks, D.H., et al. (2021). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50: D785–D794.

m
GTDB_r89_54k
bridges.monash.edu
researchdata.edu.au
tar
Updated Jul 23, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Meric; Ryan Wick (2019). GTDB_r89_54k [Dataset]. http://doi.org/10.26180/5d369804283f0
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.26180/5d369804283f0
Dataset updated
Jul 23, 2019
Dataset provided by
Monash University
Authors
Guillaume Meric; Ryan Wick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of (compressed) index database files suitable for use with Centrifuge, Kraken1 and Kraken2 that can be used to classify metagenomes using the GTDB_r89_54k index. More information and details at: https://github.com/rrwick/Metagenomics-Index-Correction
f
Data from: Mass Spectrometry-Based Proteomics Combined with Bioinformatic...
acs.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacek P. Dworzanski; Samir V. Deshpande; Rui Chen; Rabih E. Jabbour; A. Peter Snyder; Charles H. Wick; Liang Li (2023). Mass Spectrometry-Based Proteomics Combined with Bioinformatic Tools for Bacterial Classification [Dataset]. http://doi.org/10.1021/pr050294t.s003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1021/pr050294t.s003
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Jacek P. Dworzanski; Samir V. Deshpande; Rui Chen; Rabih E. Jabbour; A. Peter Snyder; Charles H. Wick; Liang Li
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Timely classification and identification of bacteria is of vital importance in many areas of public health. We present a mass spectrometry (MS)-based proteomics approach for bacterial classification. In this method, a bacterial proteome database is derived from all potential protein coding open reading frames (ORFs) found in 170 fully sequenced bacterial genomes. Amino acid sequences of tryptic peptides obtained by LC−ESI MS/MS analysis of the digest of bacterial cell extracts are assigned to individual bacterial proteomes in the database. Phylogenetic profiles of these peptides are used to create a matrix of sequence-to-bacterium assignments. These matrixes, viewed as specific assignment bitmaps, are analyzed using statistical tools to reveal the relatedness between a test bacterial sample and the microorganism database. It is shown that, if a sufficient amount of sequence information is obtained from the MS/MS experiments, a bacterial sample can be classified to a strain level by using this proteomics method, leading to its positive identification. Keywords: classification of bacteria • proteomics • tandem mass spectrometry • LC−MS/MS • bioinformatics
Z
GTDB r214.1 Mash Database (UNOFFICIAL MIRROR)
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josh L. Espinoza (2023). GTDB r214.1 Mash Database (UNOFFICIAL MIRROR) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8048186
Explore at:
Dataset updated
Jun 17, 2023
Dataset provided by
J. Craig Venter Institute
Authors
Josh L. Espinoza
License
https://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt
Description
This is an UNOFFICIAL host for the GTDB mash sketch based on GTDB r214.1

Intended use of this file is to include in the VEBA database for quicker GTDB-Tk analysis.

Created by running the following command using GTDB-Tk v2.3.0 on the S1 sample from Zenodo:7946802:

gtdbtk classify_wf --genome_dir veba_output/binning/prokaryotic/S1/output/genomes/ --out_dir test_output -x fa --cpus 1 --mash_db ./gtdb_r214.msh

Source Files:

gtdbtk_r214_data.tar.gz

RELEASE_NOTES.txt

Release Notes:

Release 214.1:

Correction regarding the classification of the genome "GB_GCA_902406375.1" in 214.1 release. We have identified an error in the taxonomy assignment for this particular genome.

The genome GB_GCA_902406375.1 was previously classified as Collinsella sp905215505 in some files . We have reevaluated the taxonomy and determined that the correct classification should be Collinsella sp002232035. We have rectified this error and made the necessary updates to the following files within the package: - bac120_taxonomy_r214.tsv - sp_clusters_r214.tsv - ssu_all_r214.tar.gz

Notes:

We thank Jan MareÅ¡ for his help in curating the Cyanobacteria

Phylum names have been updated following the valid publication of 42 names in IJSEM (https://pubmed.ncbi.nlm.nih.gov/34694987/), including Bacillota and Pseudomonadota

Fixed issue with SSU files where sequences started 2 bp after correct start and stopped 1 bp after correct end of sequence. Thanks to CX for bringing this issue to our attention: https://forum.gtdb.ecogenomic.org/t/16s-23s-and-ssu-all-r207/307/2

SSU files now provide sequences in their 5' to 3' orientation

Changed QC criterion for number of contigs from 1000 to 2000 in order to better align the GTDB criteria with RefSeq (https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/)

Changed QC criterion to use ar53 instead of ar122 marker set. The impact of this change was evaluated on the 353,569 genomes (~6,100 archaeal) considered for GTDB R207: -- only 1 additional genome passed QC -- only 21 additional genomes failed QC which included the following species representatives: -- s_Methanoregula sp002497485 -- s_Methanobrevibacter_A sp017634055 -- s_Methanosphaera sp003266165 -- s_MGIIa-L1 sp002688825 -- s_MGIIb-N2 sp002503665 -- s_MGIIa-L2 sp002692685 -- s_MGIIb-O3 sp002730445 -- s_DTDI01 sp011334935 -- s_Methanosphaera sp017652595 -- s_Nitrosopelagicus sp902606945 -- s_Methanolinea sp002501965

If you have found this useful, please cite the original publications:

Chaumeil PA, et al. 2022. GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database. Bioinformatics, btac672.

Parks, D.H., et al. (2021). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50: D785–D794.
The Encyclopedia of Domains (TED) structural domains assignments for...
zenodo.org
data.niaid.nih.gov
application/gzip, bz2 +1
Updated Oct 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13908086
Explore at:
application/gzip, zip, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.13908086
Dataset updated
Oct 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 31, 2024
Description
Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).

For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.

For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.

Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .

CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.

Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

Changelog Version 5:

Add: ted_365m.domain_summary.cath.globularity.taxid.tsv.tar.gz - This table, in the same format as the previous ted_100_324m.domain_summary.cath.globularity.taxid.tsv.tar.gz, contains per-domain annotations for the whole of TED, including metadata on domain quality metrics such as secondary structure elements counts, globularity scores, average pLDDT and taxonomical assignments.

Add: high_symmetry_folds_set.domain_summary.tsv.gz - subset of ted_365m.domain_summary.cath.globularity.taxid.tsv containing information on 6,433 high symmetry folds in TED. The entries are sorted in descending order by Z-score obtained from SymD.

Add: high_symmetry_folds_set_models.tar.gz - TED domain models in PDB format for 6,433 high symmetry folds in TED.

Add: ISP_data.tar.gz - Raw data for Interacting SuperFamily Pairs calculations used in the manuscript. A more detailed description of the ISP data is available below as well as within the tar.gz file.

Add: ted_redundant_40m_domain_id.list.gz - list of TED_domain_ID in TED redundant

Add: ted_100_324m_domain_id.list.gz - list of TED_domain_ID in TED100

Fix/Replace: A domain-level summary of TED, now consolidated into ted_365m.domain_summary.cath.globularity.taxid.tsv, is consistent with the protocol used in the manuscript. As Foldclass and Foldseek T-level hits provide all 4 CATH digits, we removed the H portion of the CATH code from each prediction at the T-level.
Previously, the following columns
14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300
15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment.
16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass
sometimes showed an additional label with a T-level prediction by Foldclass in the case of T-level assignments obtained by Foldseek, e.g.
3.40.30,3.40.30 T foldseek,foldclass
This has now been corrected to reflect the TED protocol, with Foldclass T-level assignments applied only to domains where a T-level assignment could not be applied using Foldseek, e.g.
domain-x 3.40.30 T foldseek
domain-y 3.20.20 T foldclass

Thus, in the current version of the data, CATH assignments label can only be
H-level assignment by Foldseek (i.e. 3.40.50.300 H foldseek)
T-level assignment by Foldseek (i.e. 3.40.30 T foldseek)
T-level assignment by Foldclass (i.e. 3.40.30 T foldclass)
or no assignment (- - - )

This dataset contains:

ted_214m_per_chain_segmentation.tsv
The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
1. AFDB_model_ID: chain identifier from AFDB in the format AF-

ted_365m_domain_boundaries_consensus_level.tsv.gz
The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
1. TED_ID: TED domain identifier in the format AF-

ted_100_324m_domain_id.list.gz - list of ~324 million domain identifiers in TED100, one per line in the format AF-

ted_redundant_40m_domain_id.list.gz - list of ~40 million domain identifiers in TED redundant, one per line in the format AF-

ted_365m.domain_summary.cath.globularity.taxid.tsv, novel_folds_set.domain_summary.tsv and high_symmetry_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). novel_folds_set.domain_summary.tsv is sorted by novelty
Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

1. ted_id - TED domain identifier in the format AF-

ted_324m_seq_clustering.cathlabels.tsv.gz
The file contains the results of the domain sequences clustering with MMseqs2.
Columns:
1. Cluster_representative
2. Cluster_member
3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz

The file ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz contains a header with the
n
SUPFAM
neuinfo.org
scicrunch.org
+2more
Updated Nov 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SUPFAM [Dataset]. http://identifiers.org/RRID:SCR_005304
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005304
Dataset updated
Nov 14, 2024
Description
SUPFAM is a database that consists of clusters of potentially related homologous protein domain families, with and without three-dimensional structural information, forming superfamilies. The present release (Release 3.0) of SUPFAM uses homologous families in Pfam (Version 23.0) and SCOP (Release 1.69) which are examples of sequence -alignment and structure classification databases respectively. The two steps involved in setting up of SUPFAM database are * Relating Pfam and SCOP families using a new profile-profile alignment algorithm AlignHUSH. This results in identifying many Pfam families which could be related to a family or superfamily of known structural information. * An all-against-all match among Pfam families with yet unknown structure resulting in identification of related Pfam families forming new potential superfamilies. The SUPFAM database can be used in either the Browse mode or Search mode. In Browse mode you can browse through the Superfamilies, Pfam families or SCOP families. In each of these modes you will be presented with a full list which can be easily browsed. In Search mode, you can search for Pfam families, SCOP families or Superfamilies based on keywords or SCOP/Pfam identifiers of families and superfamilies., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.
f
DECIPHER (SILVA_r132) training set for classification
figshare.com
xz
Updated Jun 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Trivedi (2020). DECIPHER (SILVA_r132) training set for classification [Dataset]. http://doi.org/10.6084/m9.figshare.12443522.v1
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12443522.v1
Dataset updated
Jun 7, 2020
Dataset provided by
figshare
Authors
Christopher Trivedi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a link to the (previous) DECIPHER (http://www2.decipher.codes/Downloads.html) SILVA_r132 training set since it has been updated to the SIVLA_r138 training set on their website.This is for use in an amplicon training workflow as part of the Bioinformatics Virtual Coordination Network (BVCN; https://biovcnet.github.io/). The tutorial in question can be found on the BVCN github - https://github.com/biovcnet/topic-amplicons/tree/master/Lesson03b.
Z
Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with...
data.niaid.nih.gov
zenodo.org
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathur, Arpit; Anam, Karishma; Gawde, Vaibhav; Terse, Vishram; Bhanshe, Prasanna; Joshi, Swapnali; Chaudhary, Shruti; Chatterjee, Gaurav; Rajpal, Sweta; Tembhare, Prashant; Mirgh, Sumeet; Shetty, Alok; Punatar, Sachin; Nayak, Lingaraj; Jain, Hasmukh; Sengar, Manju; Bagal, Bhausaheb; Subramanian, PG; Gujral, Sumeet; Jindal, Nishant; Shetty, Dhanalaxmi; Khattry, Navin; Gokarn, Anant; Patkar, Nikhil (2024). Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with Acute Leukemia and Sepsis, Including Those Undergoing Bone Marrow Transplantation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13356510
Explore at:
Dataset updated
Aug 24, 2024
Dataset provided by
Advanced Centre for Treatment, Research and Education in Cancer
Tata Memorial Hospital
Authors
Mathur, Arpit; Anam, Karishma; Gawde, Vaibhav; Terse, Vishram; Bhanshe, Prasanna; Joshi, Swapnali; Chaudhary, Shruti; Chatterjee, Gaurav; Rajpal, Sweta; Tembhare, Prashant; Mirgh, Sumeet; Shetty, Alok; Punatar, Sachin; Nayak, Lingaraj; Jain, Hasmukh; Sengar, Manju; Bagal, Bhausaheb; Subramanian, PG; Gujral, Sumeet; Jindal, Nishant; Shetty, Dhanalaxmi; Khattry, Navin; Gokarn, Anant; Patkar, Nikhil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Next Generation Sequencing (NGS) analysis of Cell-Free DNA provides valuable insights into a spectrum of pathogenic species (particularly bacterial) in blood. Patients with Sepsis often face problems like delays in treatment regimens (combination or cocktail of antibiotics) due to the long turnaround time (TAT) of classical and standard blood culture procedures. NGS gives results with lower TAT along with high-depth coverage. The use of NGS may be a possible solution to deciding treatment regimens for patients without losing precious time and more accurately possibly saving lives.

Our curated dataset is of bacterial species or strains detected along with their genome size in 107 AML patients diagnosed with Sepsis clinically. Cell-free DNA profiles of patients were built and sequencing was done in Illumina (NovaSeq and NextSeq). Bioinformatic analysis was performed using two classification algorithms namely kraken2 and kaiju. For kraken2 based classification reference bacterial index developed by Carlo Ferravante et al (Zenodo 2020) (link: https://zenodo.org/records/4055180) was used, while for kaiju-based classification reference database named "nr_euk" dated "2023-05-10" (link: https://bioinformatics-centre.github.io/kaiju/downloads.html) was used.

Genome size annotation is important in metagenomics since for the use of depth of coverage (abundance), genome size is required. In metagenomic classification algorithms like kraken/kraken2 and kaiju output computes reads assigned only and not abundance. In kaiju, the problem is more complicated since the reference database does not have a fasta file but only an index file from which alignment is done.

To address the above challenges to compute "depth of coverage" or simply abundance, we build a Genome size annotator tool (https://github.com/patkarlab/Genome-Size-Annotation) which provides genome size for each species detected given its taxid is available. In this tool, the NCBI Datasets tool, NCBI Genome API check tool, and Data Mining from AI search engines like perplexity.ai are used.

We have curated two datasets

Kraken2 dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kraken_genome_annotation"Kaiju dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kaiju_genome_annotation"

*Please note that for kraken2 curated dataset, we used data mining from the AI search engine perplexity.ai while for kaiju we did not use perplexity, ai, and any species whose genome size was not found was labeled "NA"
m
Data from: A novel protein motif finding algorithm for classification of the...
bridges.monash.edu
researchdata.edu.au
pdf
Updated Nov 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun, Deng-Kuan; Zhang, Tong-Liang; Ding, Yong-Sheng (2017). A novel protein motif finding algorithm for classification of the ligase subfamilies [Dataset]. http://doi.org/10.4225/03/5a1371c69c0e3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5a1371c69c0e3
Dataset updated
Nov 21, 2017
Dataset provided by
Monash University
Authors
Sun, Deng-Kuan; Zhang, Tong-Liang; Ding, Yong-Sheng
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
The algorithm of extracting motifs from a family or subfamily is still a hot spot in bioinformatics. It not only contributes to understand functions of proteins and predicts the classification which a unknown protein sequence belongs to, but also helps to study the protein-protein interaction. In this paper, we present a novel algorithm to extract motifs of a subfamily, which is based on feature selection and position connection. Position connection is applied to generate motifs, which is the hybrid method with mechanism of vote decision-making to construct the classifier of the ligase subfamilies. Through testing in the database, more than 95.87% predictive accuracy is achieved. The result demonstrates that this novel method is practical. In addition, the method illuminates that motifs play an important role to classify proteins and research the characteristics of the subfamilies or families of protein database. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Datasets for Lupo et al. (2022) An extended reservoir of class-D...
figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Feb 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valérian Lupo; Denis BAURAIN; Frédéric Kerff (2022). Datasets for Lupo et al. (2022) An extended reservoir of class-D beta-lactamases in non-clinical bacterial strains [Dataset]. http://doi.org/10.6084/m9.figshare.18544955.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18544955.v2
Dataset updated
Feb 15, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Valérian Lupo; Denis BAURAIN; Frédéric Kerff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Lupo et al. 2022: Archive content for v2Overview...

17 directories, 182 files README.md: this file.command-line.sh: examples of bash commands to use or generate the files stored in this archive.biosampleThis directory contains input and output files used to assign a “clinical” score to a BioSample report (…)bldb_oxaFile in .fasta format of the reference OXA-family sequences from the Beta-lactamase Database (BLDB) used for annotation with the annotate.pl perl script from Bio::MUST modules.genetic_environmentThis directory contains the list of bacterial assembly download links in .csv format to provide to GeneSpy and the list of contig accession numbers to download with the command-line efetch tool from the NCBI E-utilities.local_refseq_dbThe list of the assembly accession numbers of the local RefSeq database built on 7th of December 2017.ncbi_pathogenThis directory contains consolidated FASTA (.fasta) and TSV (.tab) files downloaded from the NCBI Pathogen Detection server (ftp://ftp.ncbi.nlm.nih.gov/pathogen/):all-prot-nr.fastaall_bla.tabIt also contains files associated to class-D beta-lactamases (…)oxa_familyThis directory contains the FASTA file bla_d.fasta with the 24,916 OXA-family protein selected with the ompa-pa.pl script and its deduplicated file clst95_bla_d.fasta and also the coordinates file class_d98.bb and the sequence accession identifier file class_d98.idl from ompa-pa.pl.alignmentsThree alignments of OXA-family proteins are available (…)treeThe mapper.idm is a TSV file that contains the short and corresponding long sequence identifiers used to rename sequences for booster and RAxML tree.boosterThis directory contains raw output files obtained from the booster web server in NEWICK format. boosterweb_tbe_norm.nhis the final tree file.consenseConsensus tree computed with consense (PHYLIP package) using the 100 replicate trees of RAxMLRAxML_bootstrap.classd-final-edit_188-RAXML-PROTGAMMALGF-100xRAPIDBP.raxmlThis directory contains raw output files of RAxML in NEWICK format, computed from the reduced alignment classd-final-edit_188.fasta.oxa_family_clustersThis directory contains alignment files in FASTA format and the corresponding .hmm profile files for non-singleton clusters (representative sequences) (…)oxa_family_domainsThe 3510 unique OXA-family sequences and their corresponding taxonomy are available in FASTA format 3510_bla.fastaand TSV format 3510_bla.tax (…)phylogenetic_clusteringThis directory contains a templatized R script mcl.script.R.tt used to compute phylogenetic clustering, the ladderized rooted OXA-family tree used by the R script and its associated traits file.scriptsThis directory contains various perl scripts (…)sql_dbThis directory contains the SQL files for the results database (…)taxdump-20180208Mirror of the NCBI Taxonomy used in this study (downloaded on 8th of February 2018).
InpactorDB: A Plant classified lineage-level LTR retrotransposon reference...
zenodo.org
zip
Updated Mar 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo; Guyot Romain; Guyot Romain; Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo (2022). InpactorDB: A Plant classified lineage-level LTR retrotransposon reference library for free-alignment methods based on Machine Learning [Dataset]. http://doi.org/10.5281/zenodo.4386317
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4386317
Dataset updated
Mar 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo; Guyot Romain; Guyot Romain; Orozco-Arias Simon; Jaimes Paula A.; Candamil Mariana; Jiménez-Varón Cristian Felipe; Tabares-Soto Reinel; Isaza Gustavo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LTR retrotransposons are mobile elements that make up the major part of most plant genomes. Their identification and annotation via bioinformatics approaches represent a major challenge in the era of massive plant genome sequencing. In addition to their involvement in the variation in genome size, these elements are also associated in the function and structure of different chromosomal regions and in the alteration of the function of coding regions, among others. Several plant retrotransposon sequence databases of LTR retrotransposons are available with public access such as PGSB, RepetDB or restricted access such as Repbase. Although they are useful for approaches to identify LTR-RTs in new genomes by similarity, the elements of these databases are not classified down to the lineage/family level. with great depth.

Here, we present InpactorDB a semi-curated dataset composed of 130,511 elements from 195 plant genomes (belonging to 108 plant species), classified down to the lineage level. This data set has been used to train two deep neural networks (one fully connected and one convolutional) for fast classification of elements. Used in lineage-level classification approaches, we obtain a score above 98% of F1-score, precision and recall.

In order to classify elements of the ‘LTR_STRUC’ and ‘EDTA’ datasets, we used the methodology proposed by Inpactor, which uses homology-based strategy with known coding domains belonging to LTR-RTs. We utilized the RexDB domain library as reference. LTR-RTs were classified into superfamilies, Gypsy (RLG) or Copia (RLC) and sub-classified into lineages according to the similarities of five different amino acid reference domains (GAG, AP, RT, RNAseH, and INT domains). In addition, we applied filters to remove keep only intact elements:

1) to remove predicted elements with domains from two different superfamilies (i.e. Gypsy and Copia),

2) or elements with domains belonging to two or more different lineages,

3) to remove elements with lengths different than those reported by Gypsy Database (GyDB) with a tolerance of 20%,

4) to delete incomplete elements which has less than three identified domains, and

5) to remove elements with insertions of TE class II (reported in Repbase).

The final non-redundant version of InpactorDB consists of 67,305 LTR retrotransposons. Both redundant and non-redundant versions of InpactorDB are available in Fasta format in which sequences have identifiers with the following general Identification code:

>Superfamily-Lineage-plant_family-specie-source-length-ID,

Where Superfamily can is either RLC (for Copia) or RLG (for Gypsy), Lineage/family follows following the RexDB nomenclature, source (can be Repbase, RepetDB, PGSB, LTR_STRUC or EDTA datasets), length, and ID, is a unique number which identify each element inside the InpactorDB.
The hybrid database of VITAP and related taxonomic assignments of IMG/VR...
figshare.com
zip
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaiyang Zheng (2024). The hybrid database of VITAP and related taxonomic assignments of IMG/VR (v.4) vOTUs [Dataset]. http://doi.org/10.6084/m9.figshare.25426159.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25426159.v3
Dataset updated
Oct 18, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Kaiyang Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The hybrid database of VITAP and related taxonomic assignments of IMG/VR (v.4) vOTUs (https://github.com/DrKaiyangZheng/VITAP).
Removing contaminants from databases of draft genomes
plos.figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer Lu; Steven L. Salzberg (2023). Removing contaminants from databases of draft genomes [Dataset]. http://doi.org/10.1371/journal.pcbi.1006277
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1006277
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jennifer Lu; Steven L. Salzberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.
Z
Human Disease Ontology 2018 update: classification, content and workflow...
data.niaid.nih.gov
zenodo.org
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quentin St.Charles (2023). Human Disease Ontology 2018 update: classification, content and workflow expansion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8083644
Explore at:
Dataset updated
Jun 29, 2023
Authors
Quentin St.Charles
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
ABSTRACT:

The Human Disease Ontology (DO) (http://www.disease-ontology.org), database has undergone significant expansion in the past three years. The DO disease classification includes specific formal semantic rules to express meaningful disease models and has expanded from a single asserted classification to include multiple-inferred mechanistic disease classifications, thus providing novel perspectives on related diseases. Expansion of disease terms, alternative anatomy, cell type and genetic disease classifications and workflow automation highlight the updates for the DO since 2015. The enhanced breadth and depth of the DO's knowledgebase has expanded the DO's utility for exploring the multi-etiology of human disease, thus improving the capture and communication of health-related data across biomedical databases, bioinformatics tools, genomic and cancer resources and demonstrated by a 6.6× growth in DO's user community since 2015. The DO's continual integration of human disease knowledge, evidenced by the more than 200 SVN/GitHub releases/revisions, since previously reported in our DO 2015 NAR paper, includes the addition of 2650 new disease terms, a 30% increase of textual definitions, and an expanding suite of disease classification hierarchies constructed through defined logical axioms.

Instructions:

Data was cleaned. Duplicates and unnecessary columns were removed. Title of columns were changed.

Inspiration:

This dataset uploaded to U-BRITE for "DRG_DEPOT" summer 2023 team project.

Acknowledgements:

Schriml, L. M., Mitraka, E., Munro, J., Tauber, B., Schor, M., Nickle, L., Felix, V., Jeng, L., Bearer, C., Lichenstein, R., Bisordi, K., Campion, N., Hyman, B., Kurland, D., Oates, C. P., Kibbey, S., Sreekumar, P., Le, C., Giglio, M., & Greene, C.

Human Disease Ontology 2018 update: classification, content and workflow expansion

Nucleic Acids Research 2019; 47(D1), D955–D962;PMID:30407550;DOI:https://doi.org/10.1093/nar/gky1032

U-BRITE last update data: 06/28/2023

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005

Protein Structural Domain Classification

Explore at:

Unique identifier

https://identifiers.org/MIR:00100005

Dataset updated

Sep 30, 2024

Description

CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

Clear search

Close search

Google apps

Main menu

Protein Structural Domain Classification

Alternative Splicing Annotation Project II Database

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

uniprot-database_(type_ko).27.09.2019.tab.rar

DAVID

Databases for MyCodentifier: A tool for routine identification of...

GTDB r220 Mash Database (UNOFFICIAL MIRROR)

GTDB_r89_54k

Data from: Mass Spectrometry-Based Proteomics Combined with Bioinformatic...

GTDB r214.1 Mash Database (UNOFFICIAL MIRROR)

Release 214.1:

Notes:

The Encyclopedia of Domains (TED) structural domains assignments for...

Dataset description:

Changelog Version 5:

This dataset contains:

SUPFAM

DECIPHER (SILVA_r132) training set for classification

Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with...

Data from: A novel protein motif finding algorithm for classification of the...

Datasets for Lupo et al. (2022) An extended reservoir of class-D...

InpactorDB: A Plant classified lineage-level LTR retrotransposon reference...

The hybrid database of VITAP and related taxonomic assignments of IMG/VR...

Removing contaminants from databases of draft genomes

Human Disease Ontology 2018 update: classification, content and workflow...

Protein Structural Domain Classification