THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
Structural features of genomes, including the three-dimensional arrangement of DNA in the nucleus, are increasingly seen as key contributors to the regulation of gene expression. However, studies on how genome structure and nuclear organisation influence transcription have so far been limited to a handful of model species. This narrow focus limits our ability to draw general conclusions about the ways in which three-dimensional structures are encoded, and to integrate information from three-dimensional data to address a broader gamut of biological questions. Here, we generate a complete and gapless genome sequence for the filamentous fungus, Epichloë festucae. We use Hi-C data to examine the three-dimensional organisation of the genome, and RNA-seq data to investigate how Epichloë genome structure contributes to the suite of transcriptional changes needed to maintain symbiotic relationships with the grass host. Our results reveal a genome in which very repeat-rich blocks of DNA with discrete boundaries are interspersed by gene-rich sequences that are almost repeat-free. In contrast to other species reported to date, the three-dimensional structure of the genome is anchored by these repeat blocks, which act to isolate transcription in neighbouring gene-rich regions. Genes that are differentially expressed in planta are enriched near the boundaries of these repeat-rich blocks, suggesting that their three-dimensional orientation partly encodes and regulates the symbiotic relationship formed by this organism.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for the paper: Machine learning reveals the diversity of human 3D chromatin contact patterns
GitHub: https://github.com/erin-n-gilbertson/3DGenome-diversity/tree/main
biorXiv: https://www.biorxiv.org/content/10.1101/2023.12.22.573104v1.full
Manuscript accepted at Molecular Biology and Evolution
Of primary interest will be the example predictions genome wide for hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual per genome along with the Jupyter notebook tutorial for making your own Akita predictions given any input 1MB sequence.
We analyzed conservation of condensin II complex in 24 species across the tree of life subunits with a multistep BLAST approach. The data found here is the BLAST alignments for these searches. The first searches were conducted in October/November 2019 and were manually double-checked in February and March 2020. Searches for other organisms were conducted in June 2020. All alignments were posted in: Our approach was based on a search strategy as used in earlier work by King et al. (https://doi.org/10.1093/molbev/msz140). We started by collecting publicly available protein sequences of the condensin I and II complex subunits of four diverse species from Uniprot: Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans and Arabidopsis thaliana. As a positive control we searched for SMC2 and SMC4, and the condensin I subunits, which are thought to be essential in all species. In the first alignment step, we used tblastn to search with the translated protein sequences of the above species against the nucleotide collection (nr/nt) database of the target species. The Expect threshold was set at 0.05. We reported an alignment as a hit when it had an E-value of 1E-10 or less with multiple regions of alignment. If there was an alignment with less confidence, we did an extra validation step to confirm the alignment. This step entailed downloading the translated nucleotide sequence of the putative alignment and using tblastn to search against the genome of a closely related organism with an annotated genome. If this search yielded the putative protein we used as a bait, we considered the hit validated. In the second alignment step we used the same approach, but we blasted against the wgs database of the target species. We again used 1E-10 as E-value cut-off. In the third step, only a few organisms still had missing subunits. To make an extra effort to find these subunits, we used the corresponding subunits of the nearest neighbour, which we identified in step 1 or 2, as bait. As the identified subunits were all nucleotide sequences, we used tblastx to translate these query sequences to protein sequences and blast against a translated nucleotide database. In this step we searched both the nr/nt database and the wgs database. As we were able to identify all SMC2/4 subunits, but still missed condensin II subunits we are now fairly sure these organisms indeed miss these condensin II subunits. However, it is still possible these organisms do have all condensin II subunits, but with very low sequence conservation. We were also able to identify the condensin I subunits in almost all species, with two notable exceptions (see Table S4). The Arctic lamprey lacked condensin I subunits CAPG and CAPD2. Because we were able to identify all condensin II subunits in this organism, we still included this species in our analysis. The other exception is the tardigrade. In this species we identified SMC2 and SMC4, but could not identify any of the accessory subunits of condensin I nor II. There are multiple possible explanations for this. On the one hand, it might have a biological explanation, for example in this organism condensin’s accessory subunits have evolved beyond recognition with our methods, or this species indeed has lost both condensin I and II. On the other hand, the missing subunits may be explained by a technical issue, e.g. the quality of the databases. Therefore we cannot with full certainty conclude that condensin II is indeed missing in the tardigrade, and this will need to be investigated further.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Understanding genome organization requires integration of DNA sequence and 3D spatial context, however, existing genome-wide methods lack either base-pair sequence resolution or direct spatial localization. Here, we describe in situ genome sequencing (IGS), a method for simultaneously sequencing and imaging genomes within intact biological samples. We applied IGS to human fibroblasts and early mouse embryos, spatially localizing thousands of genomic loci in individual nuclei. Using these data, we characterized parent-specific changes in genome structure across embryonic stages, revealed single-cell chromatin domains in zygotes, and uncovered epigenetic memory of global chromosome positioning within individual embryos. These results demonstrate how in situ genome sequencing can directly connect sequence and structure across length scales from single base pairs to whole organisms.
A large database of CATH protein domain assignments for ENSEMBL genomes and Uniprot sequences. Gene3D is a resource of form studying proteins and the component domains. Gene3D takes CATH domains from Protein Databank (PDB) structures and assigns them to the millions of protein sequences with no PDB structures using Hidden Markov models. Assigning a CATH superfamily to a region of a protein sequence gives information on the gross 3D structure of that region of the protein. CATH superfamilies have a limited set of functions and so the domain assignment provides some functional insights. Furthermore most proteins have several different domains in a specific order, so looking for proteins with a similar domain organization provides further functional insights. Strict confidence cut-offs are used to ensure the reliability of the domain assignments. Gene3D imports functional information from sources such as UNIPROT, and KEGG. They also import experimental datasets on request to help researchers integrate there data with the corpus of the literature. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. The Gene3D web services provide programmatic access to the CATH-Gene3D annotation resources and in-house software tools. These services include Gene3DScan for identifying structural domains within protein sequences, access to pre-calculated annotations for the major sequence databases, and linked functional annotation from UniProt, GO and KEGG.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset (Part 2) provides additional chromatin tracks files required for using the chromatin track plotting functions of Orca. Orca is a sequence-based deep learning modeling framework for multiscale genome 3D architecture.
The MIPS Comprehensive Yeast Genome Database (CYGD) aims to present information on the molecular structure and functional network of the entirely sequenced, well-studied model eukaryote, the budding yeast Saccharomyces cerevisiae. In addition, the data of various projects on related yeasts are used for comparative analysis.
GTOP is a database consists of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures. GTOP is built by the Laboratory of Gene-Product Informatics at the National Institute of Genetics. This research is supported by the Japan Science and Technology Corporation and Grants-in-Aid for Scientific Research (Genomes in category C) from the Ministry of Education, Science, Sports and Culture of Japan. We use the following methods: Prediction of 3D structure Sequence homology search of PDB, using REVERSE PSI-BLAST. Functional predictions (family classifications) Sequence homology search of Swiss-Prot, a well-annotated sequence database, with the use of BLAST. Other analytical methods We are also carrying out the following analyses: Motif Analysis(PROSITE) Family classification(Pfam) Prediction of transmembrane helix domains(SOSUI) Prediction of coiled-coil regions(Multicoil) Repetitive sequence analysis(RepAlign)
Description of linked resources for this dataset, all links can be found in the related dataset section.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset (Part 1) provide the core resource files required for using the code of Orca, including models and the hg38 reference genome (resources_core.tar.gz), and the micro-C mcool files required for extracting the experimental observations (resources_mcools.tar.gz). Orca is a sequence-based deep learning modeling framework for multiscale genome 3D architecture.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 16, 2013. Database and customized tools to study the PFAM protein domain content of the transcriptome for all expressed genes of Homo sapiens, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans tethered to both a genomics array repository database and a range of external information resources. GeneSpeed has merged information from several existing data sets including the Gene Ontology Consortium, InterPro, Pfam, Unigene, as well as micro-array datasets. GeneSpeed is a database of PFAM domain homology contained within Unigene. Because Unigene is a non-redundant dbEST database, this provides a wide encompassing overview of the domain content of the expressed transcriptome. We have structured the GeneSpeed Database to include a rich toolset allowing the investigator to study all domain homology, no matter how remote. As a result, homology cutoff score decisions are determined by the scientist, not by a computer algorithm. This quality is one of the novel defining features of the GeneSpeed database giving the user complete control of database content. In addition to a domain content toolset, GeneSpeed provides an assortment of links to external databases, a unique and manually curated Transcription Factor Classification list, as well as links to our newly evolving GeneSpeed BetaCell Database. GeneSpeed BetaCell is a micro-array depository combined with custom array analysis tools created with an emphasis around the meta analysis of developmental time series micro-array datasets and their significance in pancreatic beta cells.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 9. P values of the present RNA-seq analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 7. RPKM and fold change of all Parachlorella transcripts.
https://ega-archive.org/dacs/EGAC00001001309https://ega-archive.org/dacs/EGAC00001001309
We used novel processing techniques to obtain whole genome data together with 3D anatomic and histomorphologic analysis in two men (GP5 and GP12) with high risk PrCa undergoing radical prostatectomy. A total of 22 whole genome-sequenced sites (16 primary cancer foci and 6 lymph node metastatic) were analyzed using evolutionary reconstruction tools and spatio-evolutionary models. Probability models were used to trace spatial and chronological origins of the primary tumor and metastases, chart their genetic drivers, and distinguish metastatic and non-metastatic subclones.
Database of a set of standard 3D virtual models at different stages of development from Carnegie Stages (CS) 12-23 (approximately 26-56 days post conception) in which various anatomical regions have been defined with a set of anatomical terms at various stages of development (known as an ontology). Experimental data is captured and converted to digital format and then mapped to the appropriate 3D model. The ontology is used to define sites of gene expression using a set of standard descriptions and to link the expression data to an ''''anatomical tree''''. Human data from stages CS12 to CS23 can be submitted to the HUDSEN Gene Expression Database. The anatomy ontology currently being used is based on the Edinburgh Human Developmental Anatomy Database which encompasses all developing structures from CS1 to CS20 but is not detailed for developing brain structures. The ontology is being extended and refined (by Prof Luis Puelles, University of Murcia, Spain) and will be incorporated into the HUDSEN database as it is developed. Expression data is annotated using two methods to denote sites of expression in the embryo: spatial annotation and text annotation. Additionally, many aspects of the detection reagent and specimen are also annotated during this process (assignment of IDs, nucleotide sequences for probes etc). There are currently two main ways to search HUDSEN - using a gene/protein name or a named anatomical structure as the query term. The entire contents of the database can be browsed using the data browser. Results may be saved. The data in HUDSEN is generated from both from researchers within the HUDSEN project, and from the wider scientific community. The HUDSEN human gene expression spatial database is a collaboration between the Institute of Human Genetics in Newcastle, UK, and the MRC Human Genetics Unit in Edinburgh, UK, and was developed as part of the Electronic Atlas of the Developing Human Brain (EADHB) project (funded by the NIH Human Brain Project). The database is based on the Edinburgh Mouse Atlas gene expression database (EMAGE), and is designed to be an openly available resource to the research community holding gene expression patterns during early human development.
GTOP is a database consists of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures. GTOP is built by the Laboratory of Gene-Product Informatics at the National Institute of Genetics. This research is supported by the Japan Science and Technology Corporation and Grants-in-Aid for Scientific Research (Genomes in category C) from the Ministry of Education, Science, Sports and Culture of Japan. We use the following methods: Prediction of 3D structure Sequence homology search of PDB, using REVERSE PSI-BLAST. Functional predictions (family classifications) Sequence homology search of Swiss-Prot, a well-annotated sequence database, with the use of BLAST. Other analytical methods We are also carrying out the following analyses: Motif Analysis(PROSITE) Family classification(Pfam) Prediction of transmembrane helix domains(SOSUI) Prediction of coiled-coil regions(Multicoil) Repetitive sequence analysis(RepAlign)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Modeling long-range DNA dependencies is crucial for understanding genome structure and function for a wide-range of biological contexts in health and disease. However, effectively capturing the extensive long-range dependencies between DNA sequences, spanning millions of base pairs as seen in tasks such as three- dimensional (3D) chromatin folding, remains a significant challenge. Additionally, a comprehensive benchmark suite for evaluating tasks reliant on long-range depen- dencies is notably absent. To address this gap, we introduce DNALONGBENCH, a benchmark dataset spanning five important genomics tasks that consider long- range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signal. In order to comprehensively assess DNALONGBENCH, we evaluate the performance of three baseline methods: a task- specific expert model, a convolutional neural network (CNN)-based model, and a fine-tuned DNA foundation model, HyenaDNA. We envision DNALONGBENCH with the potential to become a standardized resource facilitating comprehensive comparisons and rigorous evaluations of the emerging DNA sequence-based deep learning models that consider long-range dependencies.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Yak is an important livestock for the people who lived in harsh and oxygen-deprived Qinghai-Tibetan Plateau and Hindu-Kush Himalayan Mountains. Although there is a yak genome be sequenced in 2012, the assembly is quite fragmented due to the limitation of Illumina sequencing technology. An accurate and complete reference genome is critical for studying genetic variation of a specie. Long-read sequences are more complete than short-read ones, and they have been successfully used for high-quality genome assembly in several species. Here, we present a high-quality assembly of the yak genome (PB_v1.0) at chromosome scale, which was constructed using long-read sequencing technology assisted by chromatin interaction technology. Compared to the previous yak genome assembly (BosGru_v2.0), the PB_v1.0 assembly has substantially improved chromosome sequence continuity, minimized repetitive structure ambiguity, and achieved gene model completeness. To intensively characterize genetic variation of yak, we generated de novo genome assemblies based on Illumina short reads of seven recognized domestic yak breeds from Tibet and Sichuan as well as one wild yak from Hoh Xil. By comparing these eight assemblies to the PB_v1.0 genome, we obtained a comprehensive map of yak genetic diversity at whole genome level and identified a few protein-coding genes that were absent from the PB_v1.0 assembly. Although wild yak suffered bottleneck effect, the genetic diversity of wild yak is still higher than that of domestic yak. By whole genome alignment, we identified breed-specific sequences and genes, this will help the breeds identification of yak.
Methods High-quality DNA was extracted from the peripheral blood of a female yak in Riwoqe County, Tibet. SMRT sequencing libraries were constructed with a Blood&Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany). A total of 142 SMRT cells generated 184.6 Gbp of subread bases with a mean read length of 9.5 kbp on a PacBio RS II instrument (Pacific Biosciences, Menlo Park, CA, USA). The Falcon (v. 0.5.0) pipeline was used for the initial assembly. The first step was to identify all overlaps in the raw reads. Then, the read error was corrected by leveraging the overlap information. The second step was to detect overlaps in the corrected reads. This step required no consensus calling. The final step was to generate the string graph assembly and the contig sequence output in FASTA format. To improve the quality of the initial assembly, 113.34 Gbp of Illumina short reads were generated from the same individual. Using Pilon(v1.23)8, 845,002 homozygous insertions, 166,908 deletions, and 2,355,196 substitutions were identified and corrected. DNA from the same individual used in the PacBio sequencing was extracted and processed according to BioNano Genomics guidelines. The raw data were assembled with the BioNano Solve (v. 3.1.00) assembly pipeline (BioNano Genomics, San Diego, CA, USA). The combination of this assembly with the initial one yielded a superior assembly with a scaffold N50 of 65.67 Mbp and a maximum scaffold length of 128.62 Mbp. Hi-C libraries were created from yak whole-blood cells, 2–5 million cells were cross-linked and digested with the restriction enzyme HindIII. The sticky ends of all fragments were biotinylated, ligated to each other to form chimeric circles, enriched, sheared, and processed into sequencing libraries wherein the individual templates were chimeras of the physically associated DNA molecules from the original cross-linking. Hi-C reads was generated by Illumina Sequencing platform. The paired-end reads were uniquely mapped onto the Bionano assembly, classified into 30 groups using 3d-DNA(20180922) as the final assembly, and referred to as PB_v1.0. The exact locations of each scaffold in the 30 groups were based on the collinearity between yak and cattle (UMD3.1.1).
Seven domestic yak breeds and one wild yak were selected for whole-genome sequencing and assembly. DNA was extracted from the ears of the Tibetan breeds, the blood of the Sichuan breeds, and the skin of the wild yak from Kunlun Spring, Hoh Xil. A whole-genome shotgun strategy and next-generation sequencing (NGS) technologies were run on the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA). Each genome was sequenced with a combination of short-insert (180 bp and 500 bp) and long-insert (2 kbp and 5 kbp) DNA libraries. SOAPdenovo (v2.04) was used to assemble each genome.
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome