Facebook
TwitterCATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy
Facebook
TwitterBioinformatics resource system including web server and web service for functional annotation and enrichment analyses of gene lists. Consists of comprehensive knowledgebase and set of functional analysis tools. Includes gene centered database integrating heterogeneous gene annotation resources to facilitate high throughput gene functional analysis., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Databases used for MyCodentifier a Nextflow pipeline to identify Mycobacterium tuberculosis complex (MTBC) and Nontuberculous mycobacteria (NTM) species from Next-generation sequencing (NGS) data.
Short description:
The pipeline is constructed using nextflow as workflow manager running in a docker container. It is able to identify species of MTBC/NTM from positive Mycobacterial Growth Indicator Tube (MGIT) cultures. To do so it uses an hsp65 database for fast identification coupled with a Metagenomic method using centrifuge to identify on genome level. For TB it also is able to identify subspecies. Results are presented in automated pdf and html reports.
| Name | Short Description |
| 20220726_ref.tar.gz | 7 major mycobacterial genomes as centrifuge classification database, used for reference-based mapping and genotype resistance prediction |
| 20220726_wgs_centrifuge_db_Radboudumc_MB.tar.gz | centrifuge classification database using Tortoli et al 2017 Mycobacterium strains + additional strains |
| genomes.tar.gz | 7 major mycobacterial genomes, annotation and Genbank files. Files are paired with 20220726_ref.tar.gz |
| snpEff.tar.gz | 7 major mycobacterial genomes annotation models for snpEff. |
| Tortoli_etal_hsp65.tar.gz | KMA database of hsp65 gene extractions of the Tortoli et al 2017 Mycobacterium strains. |
|
Used in the study: |
Databases available via ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data or https://ccb.jhu.edu/software/centrifuge/manual.shtml#custom-database |
MyCodentifier Github:
https://jordycoolen.github.io/MyCodentifier/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an UNOFFICIAL host for the GTDB mash sketch based on GTDB r220
Intended use of this file is to include in the VEBA database for quicker GTDB-Tk analysis.
Created by running the following command using GTDB-Tk v2.4.0 on the S1 sample from Zenodo:7946802:
gtdbtk classify_wf --genome_dir veba_output/binning/prokaryotic/S1/output/genomes/ --out_dir test_output -x fa --cpus 1 --mash_db ./gtdb_r220.msh
Source Files:
RELEASE_NOTES.txt
Release 220.0: -------------- GTDB release R09-RS220 comprises 596,859 genomes organised into 113,104 species clusters. Additional statistics for this release are available on the GTDB Statistics page. Release notes: -------------- - Average nucleotide identity (ANI) between genomes is now calculated using skani (Shaw et al., Nat Methods, 2023) instead of FastANI (Jain et al, Nat Commun, 2018). skani provides a substantial reduction in computational requirements while producing similar ANI values and more accurate alignment fraction (AF) values. - CheckM v2 information is included on the website and in the metadata files, noting at this stage that these data were not used for the QC step in release 220. - Post-curation cycle, we identified updated spelling for 15 taxon names: p_Calescibacterota (updated name: Calescibacteriota) c_Brachyspirae (updated name: Brachyspiria) c_Leptospirae (updated name: Leptospiria) o_Ammonifexales (updated name: Ammonificales) o_Exiguobacterales (updated name: Exiguobacteriales) o_Hydrogenedentiales (updated name: Hydrogenedentales) o_Phormidesmiales (updated name: Phormidesmidales) f_Arcanobacteraceae (updated name: Arcanibacteraceae) f_Acetonemaceae (updated name: Acetonemataceae) f_Ethanoligenenaceae (updated name: Ethanoligenentaceae) f_Exiguobacteraceae (updated name: Exiguobacteriaceae) f_Geitlerinemaceae (updated name: Geitlerinemataceae) f_Koribacteraceae (updated name: Korobacteraceae) f_Phormidesmiaceae (updated name: Phormidesmidaceae) f_Porisulfidaceae (updated name: Poriferisulfidaceae) Note that the LPSN linkouts point to the correct updated names. We encourage users to use the updated names as these will appear in the next release. - Post-curation cycle, we discovered that two provisionally named families, Nitrincolaceae and Denitrovibrionaceae have been validly named under the ICNP as Balneatricaceae and Geovibrionaceae, respectively. We encourage users to use the validly published names as these will appear in the next release. - We thank Jan Mares for his assistance in curating the class Cyanobacteriia and Brian Kemish for providing IT support to the project.
If you have found this useful, please cite the original publications:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of (compressed) index database files suitable for use with Centrifuge, Kraken1 and Kraken2 that can be used to classify metagenomes using the GTDB_r89_54k index. More information and details at: https://github.com/rrwick/Metagenomics-Index-Correction
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Timely classification and identification of bacteria is of vital importance in many areas of public health. We present a mass spectrometry (MS)-based proteomics approach for bacterial classification. In this method, a bacterial proteome database is derived from all potential protein coding open reading frames (ORFs) found in 170 fully sequenced bacterial genomes. Amino acid sequences of tryptic peptides obtained by LC−ESI MS/MS analysis of the digest of bacterial cell extracts are assigned to individual bacterial proteomes in the database. Phylogenetic profiles of these peptides are used to create a matrix of sequence-to-bacterium assignments. These matrixes, viewed as specific assignment bitmaps, are analyzed using statistical tools to reveal the relatedness between a test bacterial sample and the microorganism database. It is shown that, if a sufficient amount of sequence information is obtained from the MS/MS experiments, a bacterial sample can be classified to a strain level by using this proteomics method, leading to its positive identification. Keywords: classification of bacteria • proteomics • tandem mass spectrometry • LC−MS/MS • bioinformatics
Facebook
Twitterhttps://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt
This is an UNOFFICIAL host for the GTDB mash sketch based on GTDB r214.1
Intended use of this file is to include in the VEBA database for quicker GTDB-Tk analysis.
Created by running the following command using GTDB-Tk v2.3.0 on the S1 sample from Zenodo:7946802:
gtdbtk classify_wf --genome_dir veba_output/binning/prokaryotic/S1/output/genomes/ --out_dir test_output -x fa --cpus 1 --mash_db ./gtdb_r214.msh
Source Files:
gtdbtk_r214_data.tar.gz
RELEASE_NOTES.txt
Release Notes:
Correction regarding the classification of the genome "GB_GCA_902406375.1" in 214.1 release. We have identified an error in the taxonomy assignment for this particular genome.
The genome GB_GCA_902406375.1 was previously classified as Collinsella sp905215505 in some files . We have reevaluated the taxonomy and determined that the correct classification should be Collinsella sp002232035. We have rectified this error and made the necessary updates to the following files within the package: - bac120_taxonomy_r214.tsv - sp_clusters_r214.tsv - ssu_all_r214.tar.gz
We thank Jan Mareš for his help in curating the Cyanobacteria
Phylum names have been updated following the valid publication of 42 names in IJSEM (https://pubmed.ncbi.nlm.nih.gov/34694987/), including Bacillota and Pseudomonadota
Fixed issue with SSU files where sequences started 2 bp after correct start and stopped 1 bp after correct end of sequence. Thanks to CX for bringing this issue to our attention: https://forum.gtdb.ecogenomic.org/t/16s-23s-and-ssu-all-r207/307/2
SSU files now provide sequences in their 5' to 3' orientation
Changed QC criterion for number of contigs from 1000 to 2000 in order to better align the GTDB criteria with RefSeq (https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/)
Changed QC criterion to use ar53 instead of ar122 marker set. The impact of this change was evaluated on the 353,569 genomes (~6,100 archaeal) considered for GTDB R207: -- only 1 additional genome passed QC -- only 21 additional genomes failed QC which included the following species representatives: -- s_Methanoregula sp002497485 -- s_Methanobrevibacter_A sp017634055 -- s_Methanosphaera sp003266165 -- s_MGIIa-L1 sp002688825 -- s_MGIIb-N2 sp002503665 -- s_MGIIa-L2 sp002692685 -- s_MGIIb-O3 sp002730445 -- s_DTDI01 sp011334935 -- s_Methanosphaera sp017652595 -- s_Nitrosopelagicus sp902606945 -- s_Methanolinea sp002501965
If you have found this useful, please cite the original publications:
Chaumeil PA, et al. 2022. GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database. Bioinformatics, btac672.
Parks, D.H., et al. (2021). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50: D785–D794.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.
In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).
For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.
For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).
We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.
Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .
CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.
Facebook
TwitterSUPFAM is a database that consists of clusters of potentially related homologous protein domain families, with and without three-dimensional structural information, forming superfamilies. The present release (Release 3.0) of SUPFAM uses homologous families in Pfam (Version 23.0) and SCOP (Release 1.69) which are examples of sequence -alignment and structure classification databases respectively. The two steps involved in setting up of SUPFAM database are * Relating Pfam and SCOP families using a new profile-profile alignment algorithm AlignHUSH. This results in identifying many Pfam families which could be related to a family or superfamily of known structural information. * An all-against-all match among Pfam families with yet unknown structure resulting in identification of related Pfam families forming new potential superfamilies. The SUPFAM database can be used in either the Browse mode or Search mode. In Browse mode you can browse through the Superfamilies, Pfam families or SCOP families. In each of these modes you will be presented with a full list which can be easily browsed. In Search mode, you can search for Pfam families, SCOP families or Superfamilies based on keywords or SCOP/Pfam identifiers of families and superfamilies., THIS RESOURCE IS NO LONGER IN SERVICE. Documented on September 16,2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a link to the (previous) DECIPHER (http://www2.decipher.codes/Downloads.html) SILVA_r132 training set since it has been updated to the SIVLA_r138 training set on their website.This is for use in an amplicon training workflow as part of the Bioinformatics Virtual Coordination Network (BVCN; https://biovcnet.github.io/). The tutorial in question can be found on the BVCN github - https://github.com/biovcnet/topic-amplicons/tree/master/Lesson03b.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Next Generation Sequencing (NGS) analysis of Cell-Free DNA provides valuable insights into a spectrum of pathogenic species (particularly bacterial) in blood. Patients with Sepsis often face problems like delays in treatment regimens (combination or cocktail of antibiotics) due to the long turnaround time (TAT) of classical and standard blood culture procedures. NGS gives results with lower TAT along with high-depth coverage. The use of NGS may be a possible solution to deciding treatment regimens for patients without losing precious time and more accurately possibly saving lives.
Our curated dataset is of bacterial species or strains detected along with their genome size in 107 AML patients diagnosed with Sepsis clinically. Cell-free DNA profiles of patients were built and sequencing was done in Illumina (NovaSeq and NextSeq). Bioinformatic analysis was performed using two classification algorithms namely kraken2 and kaiju. For kraken2 based classification reference bacterial index developed by Carlo Ferravante et al (Zenodo 2020) (link: https://zenodo.org/records/4055180) was used, while for kaiju-based classification reference database named "nr_euk" dated "2023-05-10" (link: https://bioinformatics-centre.github.io/kaiju/downloads.html) was used.
Genome size annotation is important in metagenomics since for the use of depth of coverage (abundance), genome size is required. In metagenomic classification algorithms like kraken/kraken2 and kaiju output computes reads assigned only and not abundance. In kaiju, the problem is more complicated since the reference database does not have a fasta file but only an index file from which alignment is done.
To address the above challenges to compute "depth of coverage" or simply abundance, we build a Genome size annotator tool (https://github.com/patkarlab/Genome-Size-Annotation) which provides genome size for each species detected given its taxid is available. In this tool, the NCBI Datasets tool, NCBI Genome API check tool, and Data Mining from AI search engines like perplexity.ai are used.
We have curated two datasets
Kraken2 dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kraken_genome_annotation"Kaiju dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kaiju_genome_annotation"
*Please note that for kraken2 curated dataset, we used data mining from the AI search engine perplexity.ai while for kaiju we did not use perplexity, ai, and any species whose genome size was not found was labeled "NA"
Facebook
Twitterhttp://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
The algorithm of extracting motifs from a family or subfamily is still a hot spot in bioinformatics. It not only contributes to understand functions of proteins and predicts the classification which a unknown protein sequence belongs to, but also helps to study the protein-protein interaction. In this paper, we present a novel algorithm to extract motifs of a subfamily, which is based on feature selection and position connection. Position connection is applied to generate motifs, which is the hybrid method with mechanism of vote decision-making to construct the classifier of the ligase subfamilies. Through testing in the database, more than 95.87% predictive accuracy is achieved. The result demonstrates that this novel method is practical. In addition, the method illuminates that motifs play an important role to classify proteins and research the characteristics of the subfamilies or families of protein database. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lupo et al. 2022: Archive content for v2Overview...
17 directories, 182 files README.md: this file.command-line.sh: examples of bash commands to use or generate the files stored in this archive.biosampleThis directory contains input and output files used to assign a “clinical” score to a BioSample report (…)bldb_oxaFile in .fasta format of the reference OXA-family sequences from the Beta-lactamase Database (BLDB) used for annotation with the annotate.pl perl script from Bio::MUST modules.genetic_environmentThis directory contains the list of bacterial assembly download links in .csv format to provide to GeneSpy and the list of contig accession numbers to download with the command-line efetch tool from the NCBI E-utilities.local_refseq_dbThe list of the assembly accession numbers of the local RefSeq database built on 7th of December 2017.ncbi_pathogenThis directory contains consolidated FASTA (.fasta) and TSV (.tab) files downloaded from the NCBI Pathogen Detection server (ftp://ftp.ncbi.nlm.nih.gov/pathogen/):all-prot-nr.fastaall_bla.tabIt also contains files associated to class-D beta-lactamases (…)oxa_familyThis directory contains the FASTA file bla_d.fasta with the 24,916 OXA-family protein selected with the ompa-pa.pl script and its deduplicated file clst95_bla_d.fasta and also the coordinates file class_d98.bb and the sequence accession identifier file class_d98.idl from ompa-pa.pl.alignmentsThree alignments of OXA-family proteins are available (…)treeThe mapper.idm is a TSV file that contains the short and corresponding long sequence identifiers used to rename sequences for booster and RAxML tree.boosterThis directory contains raw output files obtained from the booster web server in NEWICK format. boosterweb_tbe_norm.nhis the final tree file.consenseConsensus tree computed with consense (PHYLIP package) using the 100 replicate trees of RAxMLRAxML_bootstrap.classd-final-edit_188-RAXML-PROTGAMMALGF-100xRAPIDBP.raxmlThis directory contains raw output files of RAxML in NEWICK format, computed from the reduced alignment classd-final-edit_188.fasta.oxa_family_clustersThis directory contains alignment files in FASTA format and the corresponding .hmm profile files for non-singleton clusters (representative sequences) (…)oxa_family_domainsThe 3510 unique OXA-family sequences and their corresponding taxonomy are available in FASTA format 3510_bla.fastaand TSV format 3510_bla.tax (…)phylogenetic_clusteringThis directory contains a templatized R script mcl.script.R.tt used to compute phylogenetic clustering, the ladderized rooted OXA-family tree used by the R script and its associated traits file.scriptsThis directory contains various perl scripts (…)sql_dbThis directory contains the SQL files for the results database (…)taxdump-20180208Mirror of the NCBI Taxonomy used in this study (downloaded on 8th of February 2018).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LTR retrotransposons are mobile elements that make up the major part of most plant genomes. Their identification and annotation via bioinformatics approaches represent a major challenge in the era of massive plant genome sequencing. In addition to their involvement in the variation in genome size, these elements are also associated in the function and structure of different chromosomal regions and in the alteration of the function of coding regions, among others. Several plant retrotransposon sequence databases of LTR retrotransposons are available with public access such as PGSB, RepetDB or restricted access such as Repbase. Although they are useful for approaches to identify LTR-RTs in new genomes by similarity, the elements of these databases are not classified down to the lineage/family level. with great depth.
Here, we present InpactorDB a semi-curated dataset composed of 130,511 elements from 195 plant genomes (belonging to 108 plant species), classified down to the lineage level. This data set has been used to train two deep neural networks (one fully connected and one convolutional) for fast classification of elements. Used in lineage-level classification approaches, we obtain a score above 98% of F1-score, precision and recall.
In order to classify elements of the ‘LTR_STRUC’ and ‘EDTA’ datasets, we used the methodology proposed by Inpactor, which uses homology-based strategy with known coding domains belonging to LTR-RTs. We utilized the RexDB domain library as reference. LTR-RTs were classified into superfamilies, Gypsy (RLG) or Copia (RLC) and sub-classified into lineages according to the similarities of five different amino acid reference domains (GAG, AP, RT, RNAseH, and INT domains). In addition, we applied filters to remove keep only intact elements:
1) to remove predicted elements with domains from two different superfamilies (i.e. Gypsy and Copia),
2) or elements with domains belonging to two or more different lineages,
3) to remove elements with lengths different than those reported by Gypsy Database (GyDB) with a tolerance of 20%,
4) to delete incomplete elements which has less than three identified domains, and
5) to remove elements with insertions of TE class II (reported in Repbase).
The final non-redundant version of InpactorDB consists of 67,305 LTR retrotransposons. Both redundant and non-redundant versions of InpactorDB are available in Fasta format in which sequences have identifiers with the following general Identification code:
>Superfamily-Lineage-plant_family-specie-source-length-ID,
Where Superfamily can is either RLC (for Copia) or RLG (for Gypsy), Lineage/family follows following the RexDB nomenclature, source (can be Repbase, RepetDB, PGSB, LTR_STRUC or EDTA datasets), length, and ID, is a unique number which identify each element inside the InpactorDB.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The hybrid database of VITAP and related taxonomic assignments of IMG/VR (v.4) vOTUs (https://github.com/DrKaiyangZheng/VITAP).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
ABSTRACT:
The Human Disease Ontology (DO) (http://www.disease-ontology.org), database has undergone significant expansion in the past three years. The DO disease classification includes specific formal semantic rules to express meaningful disease models and has expanded from a single asserted classification to include multiple-inferred mechanistic disease classifications, thus providing novel perspectives on related diseases. Expansion of disease terms, alternative anatomy, cell type and genetic disease classifications and workflow automation highlight the updates for the DO since 2015. The enhanced breadth and depth of the DO's knowledgebase has expanded the DO's utility for exploring the multi-etiology of human disease, thus improving the capture and communication of health-related data across biomedical databases, bioinformatics tools, genomic and cancer resources and demonstrated by a 6.6× growth in DO's user community since 2015. The DO's continual integration of human disease knowledge, evidenced by the more than 200 SVN/GitHub releases/revisions, since previously reported in our DO 2015 NAR paper, includes the addition of 2650 new disease terms, a 30% increase of textual definitions, and an expanding suite of disease classification hierarchies constructed through defined logical axioms.
Instructions:
Data was cleaned. Duplicates and unnecessary columns were removed. Title of columns were changed.
Inspiration:
This dataset uploaded to U-BRITE for "DRG_DEPOT" summer 2023 team project.
Acknowledgements:
Schriml, L. M., Mitraka, E., Munro, J., Tauber, B., Schor, M., Nickle, L., Felix, V., Jeng, L., Bearer, C., Lichenstein, R., Bisordi, K., Campion, N., Hyman, B., Kurland, D., Oates, C. P., Kibbey, S., Sreekumar, P., Le, C., Giglio, M., & Greene, C.
Human Disease Ontology 2018 update: classification, content and workflow expansion
Nucleic Acids Research 2019; 47(D1), D955–D962;PMID:30407550;DOI:https://doi.org/10.1093/nar/gky1032
U-BRITE last update data: 06/28/2023
Facebook
TwitterCATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.