100+ datasets found
  1. d

    3D-Genomics Database

    • dknet.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

  2. f

    Data from: Repeat elements organise 3D genome structure and mediate...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    • +1more
    Updated Oct 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berry, Daniel; Winter, David J.; Dupont, Pierre-Yves; Young, Carolyn A.; Ganley, Austen R. D.; Cox, Murray P.; Scott, Barry; Ram, Arvina; Schardl, Christopher L.; Liachko, Ivan (2018). Repeat elements organise 3D genome structure and mediate transcription in the filamentous fungus Epichloë festucae [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000680242
    Explore at:
    Dataset updated
    Oct 24, 2018
    Authors
    Berry, Daniel; Winter, David J.; Dupont, Pierre-Yves; Young, Carolyn A.; Ganley, Austen R. D.; Cox, Murray P.; Scott, Barry; Ram, Arvina; Schardl, Christopher L.; Liachko, Ivan
    Description

    Structural features of genomes, including the three-dimensional arrangement of DNA in the nucleus, are increasingly seen as key contributors to the regulation of gene expression. However, studies on how genome structure and nuclear organisation influence transcription have so far been limited to a handful of model species. This narrow focus limits our ability to draw general conclusions about the ways in which three-dimensional structures are encoded, and to integrate information from three-dimensional data to address a broader gamut of biological questions. Here, we generate a complete and gapless genome sequence for the filamentous fungus, Epichloë festucae. We use Hi-C data to examine the three-dimensional organisation of the genome, and RNA-seq data to investigate how Epichloë genome structure contributes to the suite of transcriptional changes needed to maintain symbiotic relationships with the grass host. Our results reveal a genome in which very repeat-rich blocks of DNA with discrete boundaries are interspersed by gene-rich sequences that are almost repeat-free. In contrast to other species reported to date, the three-dimensional structure of the genome is anchored by these repeat blocks, which act to isolate transcription in neighbouring gene-rich regions. Genes that are differentially expressed in planta are enriched near the boundaries of these repeat-rich blocks, suggesting that their three-dimensional orientation partly encodes and regulates the symbiotic relationship formed by this organism.

  3. Machine learning reveals the diversity of human 3D chromatin contact...

    • zenodo.org
    bin, csv, txt, zip
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erin Gilbertson; Erin Gilbertson (2024). Machine learning reveals the diversity of human 3D chromatin contact patterns (example predictions genome wide) [Dataset]. http://doi.org/10.5281/zenodo.13900918
    Explore at:
    zip, txt, bin, csvAvailable download formats
    Dataset updated
    Oct 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erin Gilbertson; Erin Gilbertson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data for the paper: Machine learning reveals the diversity of human 3D chromatin contact patterns

    GitHub: https://github.com/erin-n-gilbertson/3DGenome-diversity/tree/main

    biorXiv: https://www.biorxiv.org/content/10.1101/2023.12.22.573104v1.full

    Manuscript accepted at Molecular Biology and Evolution

    Of primary interest will be the example predictions genome wide for hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual per genome along with the Jupyter notebook tutorial for making your own Akita predictions given any input 1MB sequence.

    • bin: contains python script for and qsub array shell script for generating example predictions. These scripts can be modified to take in any fasta files as input.
    • akita_predictions: contains both Akita prediction output arrays and SVG files with predicted contact maps for the hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual in each of 4,873 1MB windows
    • anc_window_spearman.csv: spearman correlation between each 1KG individual and the ancestor for each 1MB window. To calculate 3D divergence subtract these values from 1.
    • basenji: basenji dir from their github, necessary in the directory to run predictions - https://github.com/calico/basenji/tree/master
    • genomes: fasta genomes for hg38 reference and human-archaic hominin ancestor used to make akita predictions
    • divergent_windows: variants and expected divergence distributions for 392 more divergent than expected windows. Defined in the manuscript as windows where 3D divergence between 1KG indiivudals and the ancestor is greater than what would be expected based on sequence divergence. See manuscript Fig. S9 for more details.
    • windows.txt: 4,873 1MB genomic windows with 100% coverage in hg38 used for Akita predictions
    • making_examples.ipynb: jupyter notebook with tutorial instructions for making Akita predictions on any human genome sequence.


  4. d

    Data from: 3D genomics across the tree of life identifies condensin II as a...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoencamp, Claire (2023). Data from: 3D genomics across the tree of life identifies condensin II as a determinant of architecture type [Dataset]. http://doi.org/10.7910/DVN/UROKAG
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hoencamp, Claire
    Description

    We analyzed conservation of condensin II complex in 24 species across the tree of life subunits with a multistep BLAST approach. The data found here is the BLAST alignments for these searches. The first searches were conducted in October/November 2019 and were manually double-checked in February and March 2020. Searches for other organisms were conducted in June 2020. All alignments were posted in: Our approach was based on a search strategy as used in earlier work by King et al. (https://doi.org/10.1093/molbev/msz140). We started by collecting publicly available protein sequences of the condensin I and II complex subunits of four diverse species from Uniprot: Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans and Arabidopsis thaliana. As a positive control we searched for SMC2 and SMC4, and the condensin I subunits, which are thought to be essential in all species. In the first alignment step, we used tblastn to search with the translated protein sequences of the above species against the nucleotide collection (nr/nt) database of the target species. The Expect threshold was set at 0.05. We reported an alignment as a hit when it had an E-value of 1E-10 or less with multiple regions of alignment. If there was an alignment with less confidence, we did an extra validation step to confirm the alignment. This step entailed downloading the translated nucleotide sequence of the putative alignment and using tblastn to search against the genome of a closely related organism with an annotated genome. If this search yielded the putative protein we used as a bait, we considered the hit validated. In the second alignment step we used the same approach, but we blasted against the wgs database of the target species. We again used 1E-10 as E-value cut-off. In the third step, only a few organisms still had missing subunits. To make an extra effort to find these subunits, we used the corresponding subunits of the nearest neighbour, which we identified in step 1 or 2, as bait. As the identified subunits were all nucleotide sequences, we used tblastx to translate these query sequences to protein sequences and blast against a translated nucleotide database. In this step we searched both the nr/nt database and the wgs database. As we were able to identify all SMC2/4 subunits, but still missed condensin II subunits we are now fairly sure these organisms indeed miss these condensin II subunits. However, it is still possible these organisms do have all condensin II subunits, but with very low sequence conservation. We were also able to identify the condensin I subunits in almost all species, with two notable exceptions (see Table S4). The Arctic lamprey lacked condensin I subunits CAPG and CAPD2. Because we were able to identify all condensin II subunits in this organism, we still included this species in our analysis. The other exception is the tardigrade. In this species we identified SMC2 and SMC4, but could not identify any of the accessory subunits of condensin I nor II. There are multiple possible explanations for this. On the one hand, it might have a biological explanation, for example in this organism condensin’s accessory subunits have evolved beyond recognition with our methods, or this species indeed has lost both condensin I and II. On the other hand, the missing subunits may be explained by a technical issue, e.g. the quality of the databases. Therefore we cannot with full certainty conclude that condensin II is indeed missing in the tardigrade, and this will need to be investigated further.

  5. o

    Data from: In situ genome sequencing resolves DNA sequence and structure in...

    • idr.openmicroscopy.org
    Updated Dec 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). In situ genome sequencing resolves DNA sequence and structure in intact biological samples [Dataset]. https://idr.openmicroscopy.org/study/idr0101/
    Explore at:
    Dataset updated
    Dec 31, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Understanding genome organization requires integration of DNA sequence and 3D spatial context, however, existing genome-wide methods lack either base-pair sequence resolution or direct spatial localization. Here, we describe in situ genome sequencing (IGS), a method for simultaneously sequencing and imaging genomes within intact biological samples. We applied IGS to human fibroblasts and early mouse embryos, spatially localizing thousands of genomic loci in individual nuclei. Using these data, we characterized parent-specific changes in genome structure across embryonic stages, revealed single-cell chromatin domains in zygotes, and uncovered epigenetic memory of global chromosome positioning within individual embryos. These results demonstrate how in situ genome sequencing can directly connect sequence and structure across length scales from single base pairs to whole organisms.

  6. r

    Gene3D

    • rrid.site
    • scicrunch.org
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Gene3D [Dataset]. http://identifiers.org/RRID:SCR_007672
    Explore at:
    Dataset updated
    Aug 23, 2025
    Description

    A large database of CATH protein domain assignments for ENSEMBL genomes and Uniprot sequences. Gene3D is a resource of form studying proteins and the component domains. Gene3D takes CATH domains from Protein Databank (PDB) structures and assigns them to the millions of protein sequences with no PDB structures using Hidden Markov models. Assigning a CATH superfamily to a region of a protein sequence gives information on the gross 3D structure of that region of the protein. CATH superfamilies have a limited set of functions and so the domain assignment provides some functional insights. Furthermore most proteins have several different domains in a specific order, so looking for proteins with a similar domain organization provides further functional insights. Strict confidence cut-offs are used to ensure the reliability of the domain assignments. Gene3D imports functional information from sources such as UNIPROT, and KEGG. They also import experimental datasets on request to help researchers integrate there data with the corpus of the literature. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. The Gene3D web services provide programmatic access to the CATH-Gene3D annotation resources and in-house software tools. These services include Gene3DScan for identifying structural domains within protein sequences, access to pre-calculated annotations for the major sequence databases, and linked functional annotation from UniProt, GO and KEGG.

  7. Orca: Sequence-based modeling of genome 3D architecture from kilobase to...

    • zenodo.org
    application/gzip
    Updated Mar 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jian Zhou; Jian Zhou (2021). Orca: Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale (Part2) [Dataset]. http://doi.org/10.5281/zenodo.4594676
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jian Zhou; Jian Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset (Part 2) provides additional chromatin tracks files required for using the chromatin track plotting functions of Orca. Orca is a sequence-based deep learning modeling framework for multiscale genome 3D architecture.

  8. d

    CYGD - Comprehensive Yeast Genome Database

    • dknet.org
    • test2.scicrunch.org
    • +1more
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). CYGD - Comprehensive Yeast Genome Database [Dataset]. http://identifiers.org/RRID:SCR_002289
    Explore at:
    Dataset updated
    Aug 18, 2024
    Description

    The MIPS Comprehensive Yeast Genome Database (CYGD) aims to present information on the molecular structure and functional network of the entirely sequenced, well-studied model eukaryote, the budding yeast Saccharomyces cerevisiae. In addition, the data of various projects on related yeasts are used for comparative analysis.

  9. n

    GTOP - Genomes To Protein structures

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). GTOP - Genomes To Protein structures [Dataset]. http://identifiers.org/RRID:SCR_007698
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    GTOP is a database consists of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures. GTOP is built by the Laboratory of Gene-Product Informatics at the National Institute of Genetics. This research is supported by the Japan Science and Technology Corporation and Grants-in-Aid for Scientific Research (Genomes in category C) from the Ministry of Education, Science, Sports and Culture of Japan. We use the following methods: Prediction of 3D structure Sequence homology search of PDB, using REVERSE PSI-BLAST. Functional predictions (family classifications) Sequence homology search of Swiss-Prot, a well-annotated sequence database, with the use of BLAST. Other analytical methods We are also carrying out the following analyses: Motif Analysis(PROSITE) Family classification(Pfam) Prediction of transmembrane helix domains(SOSUI) Prediction of coiled-coil regions(Multicoil) Repetitive sequence analysis(RepAlign)

  10. d

    Full genome and transcriptome sequence assembly of the non-model organism...

    • search.dataone.org
    • bco-dmo.org
    • +1more
    Updated Dec 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crow White; Robert J. Toonen; Mark Christie; Jean Davidson; Paul Anderson; Benjamin Daniels; Andy Lee; Cataixa López (2024). Full genome and transcriptome sequence assembly of the non-model organism Kellet’s whelk, Kelletia kelletii [Dataset]. http://doi.org/10.26008/1912/bco-dmo.945292.1
    Explore at:
    Dataset updated
    Dec 29, 2024
    Dataset provided by
    Biological and Chemical Oceanography Data Management Office (BCO-DMO)
    Authors
    Crow White; Robert J. Toonen; Mark Christie; Jean Davidson; Paul Anderson; Benjamin Daniels; Andy Lee; Cataixa López
    Time period covered
    Aug 28, 2019 - Jul 8, 2020
    Area covered
    Description

    Description of linked resources for this dataset, all links can be found in the related dataset section.

  11. Orca: Sequence-based modeling of genome 3D architecture from kilobase to...

    • zenodo.org
    application/gzip
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jian Zhou; Jian Zhou (2022). Orca: Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale (Part1) [Dataset]. http://doi.org/10.5281/zenodo.4594207
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jian Zhou; Jian Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset (Part 1) provide the core resource files required for using the code of Orca, including models and the hg38 reference genome (resources_core.tar.gz), and the micro-C mcool files required for extracting the experimental observations (resources_mcools.tar.gz). Orca is a sequence-based deep learning modeling framework for multiscale genome 3D architecture.

  12. e

    CATH-Gene3D

    • ebi.ac.uk
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Oct 21, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

  13. r

    GeneSpeed- A Database of Unigene Domain Organization

    • rrid.site
    • test2.scicrunch.org
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). GeneSpeed- A Database of Unigene Domain Organization [Dataset]. http://identifiers.org/RRID:SCR_002779
    Explore at:
    Dataset updated
    Aug 26, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 16, 2013. Database and customized tools to study the PFAM protein domain content of the transcriptome for all expressed genes of Homo sapiens, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans tethered to both a genomics array repository database and a range of external information resources. GeneSpeed has merged information from several existing data sets including the Gene Ontology Consortium, InterPro, Pfam, Unigene, as well as micro-array datasets. GeneSpeed is a database of PFAM domain homology contained within Unigene. Because Unigene is a non-redundant dbEST database, this provides a wide encompassing overview of the domain content of the expressed transcriptome. We have structured the GeneSpeed Database to include a rich toolset allowing the investigator to study all domain homology, no matter how remote. As a result, homology cutoff score decisions are determined by the scientist, not by a computer algorithm. This quality is one of the novel defining features of the GeneSpeed database giving the user complete control of database content. In addition to a domain content toolset, GeneSpeed provides an assortment of links to external databases, a unique and manually curated Transcription Factor Classification list, as well as links to our newly evolving GeneSpeed BetaCell Database. GeneSpeed BetaCell is a micro-array depository combined with custom array analysis tools created with an emphasis around the meta analysis of developmental time series micro-array datasets and their significance in pancreatic beta cells.

  14. MOESM5 of Highly efficient lipid production in the green alga Parachlorella...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano (2023). MOESM5 of Highly efficient lipid production in the green alga Parachlorella kessleri: draft genome and transcriptome endorsed by whole-cell 3D ultrastructure [Dataset]. http://doi.org/10.6084/m9.figshare.10038743.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 9. P values of the present RNA-seq analysis.

  15. f

    MOESM3 of Highly efficient lipid production in the green alga Parachlorella...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano (2023). MOESM3 of Highly efficient lipid production in the green alga Parachlorella kessleri: draft genome and transcriptome endorsed by whole-cell 3D ultrastructure [Dataset]. http://doi.org/10.6084/m9.figshare.10038737.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 7. RPKM and fold change of all Parachlorella transcripts.

  16. E

    Nurminen et al ("GP2Men") Study Primary and Metastatic Prostate Cancer Whole...

    • ega-archive.org
    Updated Oct 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Nurminen et al ("GP2Men") Study Primary and Metastatic Prostate Cancer Whole Genome Sequence Data [Dataset]. https://ega-archive.org/datasets/EGAD50000000005
    Explore at:
    Dataset updated
    Oct 4, 2023
    License

    https://ega-archive.org/dacs/EGAC00001001309https://ega-archive.org/dacs/EGAC00001001309

    Description

    We used novel processing techniques to obtain whole genome data together with 3D anatomic and histomorphologic analysis in two men (GP5 and GP12) with high risk PrCa undergoing radical prostatectomy. A total of 22 whole genome-sequenced sites (16 primary cancer foci and 6 lymph node metastatic) were analyzed using evolutionary reconstruction tools and spatio-evolutionary models. Probability models were used to trace spatial and chronological origins of the primary tumor and metastases, chart their genetic drivers, and distinguish metastatic and non-metastatic subclones.

  17. r

    HUDSEN Human Gene Expression Spatial Database

    • rrid.site
    • dknet.org
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). HUDSEN Human Gene Expression Spatial Database [Dataset]. http://identifiers.org/RRID:SCR_006325/resolver?q=*&i=rrid
    Explore at:
    Dataset updated
    Jul 14, 2025
    Description

    Database of a set of standard 3D virtual models at different stages of development from Carnegie Stages (CS) 12-23 (approximately 26-56 days post conception) in which various anatomical regions have been defined with a set of anatomical terms at various stages of development (known as an ontology). Experimental data is captured and converted to digital format and then mapped to the appropriate 3D model. The ontology is used to define sites of gene expression using a set of standard descriptions and to link the expression data to an ''''anatomical tree''''. Human data from stages CS12 to CS23 can be submitted to the HUDSEN Gene Expression Database. The anatomy ontology currently being used is based on the Edinburgh Human Developmental Anatomy Database which encompasses all developing structures from CS1 to CS20 but is not detailed for developing brain structures. The ontology is being extended and refined (by Prof Luis Puelles, University of Murcia, Spain) and will be incorporated into the HUDSEN database as it is developed. Expression data is annotated using two methods to denote sites of expression in the embryo: spatial annotation and text annotation. Additionally, many aspects of the detection reagent and specimen are also annotated during this process (assignment of IDs, nucleotide sequences for probes etc). There are currently two main ways to search HUDSEN - using a gene/protein name or a named anatomical structure as the query term. The entire contents of the database can be browsed using the data browser. Results may be saved. The data in HUDSEN is generated from both from researchers within the HUDSEN project, and from the wider scientific community. The HUDSEN human gene expression spatial database is a collaboration between the Institute of Human Genetics in Newcastle, UK, and the MRC Human Genetics Unit in Edinburgh, UK, and was developed as part of the Electronic Atlas of the Developing Human Brain (EADHB) project (funded by the NIH Human Brain Project). The database is based on the Edinburgh Mouse Atlas gene expression database (EMAGE), and is designed to be an openly available resource to the research community holding gene expression patterns during early human development.

  18. d

    GTOP - Genomes To Protein structures

    • dknet.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). GTOP - Genomes To Protein structures [Dataset]. http://identifiers.org/RRID:SCR_007698
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    GTOP is a database consists of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures. GTOP is built by the Laboratory of Gene-Product Informatics at the National Institute of Genetics. This research is supported by the Japan Science and Technology Corporation and Grants-in-Aid for Scientific Research (Genomes in category C) from the Ministry of Education, Science, Sports and Culture of Japan. We use the following methods: Prediction of 3D structure Sequence homology search of PDB, using REVERSE PSI-BLAST. Functional predictions (family classifications) Sequence homology search of Swiss-Prot, a well-annotated sequence database, with the use of BLAST. Other analytical methods We are also carrying out the following analyses: Motif Analysis(PROSITE) Family classification(Pfam) Prediction of transmembrane helix domains(SOSUI) Prediction of coiled-coil regions(Multicoil) Repetitive sequence analysis(RepAlign)

  19. H

    DNALongBench eQTL data

    • dataverse.harvard.edu
    Updated Aug 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DNALongBench_Author (2025). DNALongBench eQTL data [Dataset]. http://doi.org/10.7910/DVN/YUP2G5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    DNALongBench_Author
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Modeling long-range DNA dependencies is crucial for understanding genome structure and function for a wide-range of biological contexts in health and disease. However, effectively capturing the extensive long-range dependencies between DNA sequences, spanning millions of base pairs as seen in tasks such as three- dimensional (3D) chromatin folding, remains a significant challenge. Additionally, a comprehensive benchmark suite for evaluating tasks reliant on long-range depen- dencies is notably absent. To address this gap, we introduce DNALONGBENCH, a benchmark dataset spanning five important genomics tasks that consider long- range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signal. In order to comprehensively assess DNALONGBENCH, we evaluate the performance of three baseline methods: a task- specific expert model, a convolutional neural network (CNN)-based model, and a fine-tuned DNA foundation model, HyenaDNA. We envision DNALONGBENCH with the potential to become a standardized resource facilitating comprehensive comparisons and rigorous evaluations of the emerging DNA sequence-based deep learning models that consider long-range dependencies.

  20. n

    Data from: A chromosome-scale reference genome and genome-wide genetic...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Oct 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin-cheng Zhong; Qiu-mei Ji; Jin-wei Xin; Zhi-xin Chai; Cheng-fu Zhang; Yangla Dawa; Sang Luo; Qiang Zhang; Zhandui Pingcuo; Min-sheng Peng; Yong Zhu; Han-wen Cao; Hui Wang; Jian-lin Han (2020). A chromosome-scale reference genome and genome-wide genetic variations elucidate adaptation in yak [Dataset]. http://doi.org/10.5061/dryad.jh9w0vt7x
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2020
    Dataset provided by
    Kunming Institute of Zoology
    Chinese Academy of Agricultural Sciences
    Southwest Minzu University
    State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, Lhasa, P.R. China
    Authors
    Jin-cheng Zhong; Qiu-mei Ji; Jin-wei Xin; Zhi-xin Chai; Cheng-fu Zhang; Yangla Dawa; Sang Luo; Qiang Zhang; Zhandui Pingcuo; Min-sheng Peng; Yong Zhu; Han-wen Cao; Hui Wang; Jian-lin Han
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Yak is an important livestock for the people who lived in harsh and oxygen-deprived Qinghai-Tibetan Plateau and Hindu-Kush Himalayan Mountains. Although there is a yak genome be sequenced in 2012, the assembly is quite fragmented due to the limitation of Illumina sequencing technology. An accurate and complete reference genome is critical for studying genetic variation of a specie. Long-read sequences are more complete than short-read ones, and they have been successfully used for high-quality genome assembly in several species. Here, we present a high-quality assembly of the yak genome (PB_v1.0) at chromosome scale, which was constructed using long-read sequencing technology assisted by chromatin interaction technology. Compared to the previous yak genome assembly (BosGru_v2.0), the PB_v1.0 assembly has substantially improved chromosome sequence continuity, minimized repetitive structure ambiguity, and achieved gene model completeness. To intensively characterize genetic variation of yak, we generated de novo genome assemblies based on Illumina short reads of seven recognized domestic yak breeds from Tibet and Sichuan as well as one wild yak from Hoh Xil. By comparing these eight assemblies to the PB_v1.0 genome, we obtained a comprehensive map of yak genetic diversity at whole genome level and identified a few protein-coding genes that were absent from the PB_v1.0 assembly. Although wild yak suffered bottleneck effect, the genetic diversity of wild yak is still higher than that of domestic yak. By whole genome alignment, we identified breed-specific sequences and genes, this will help the breeds identification of yak.

    Methods High-quality DNA was extracted from the peripheral blood of a female yak in Riwoqe County, Tibet. SMRT sequencing libraries were constructed with a Blood&Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany). A total of 142 SMRT cells generated 184.6 Gbp of subread bases with a mean read length of 9.5 kbp on a PacBio RS II instrument (Pacific Biosciences, Menlo Park, CA, USA). The Falcon (v. 0.5.0) pipeline was used for the initial assembly. The first step was to identify all overlaps in the raw reads. Then, the read error was corrected by leveraging the overlap information. The second step was to detect overlaps in the corrected reads. This step required no consensus calling. The final step was to generate the string graph assembly and the contig sequence output in FASTA format. To improve the quality of the initial assembly, 113.34 Gbp of Illumina short reads were generated from the same individual. Using Pilon(v1.23)8, 845,002 homozygous insertions, 166,908 deletions, and 2,355,196 substitutions were identified and corrected. DNA from the same individual used in the PacBio sequencing was extracted and processed according to BioNano Genomics guidelines. The raw data were assembled with the BioNano Solve (v. 3.1.00) assembly pipeline (BioNano Genomics, San Diego, CA, USA). The combination of this assembly with the initial one yielded a superior assembly with a scaffold N50 of 65.67 Mbp and a maximum scaffold length of 128.62 Mbp. Hi-C libraries were created from yak whole-blood cells, 2–5 million cells were cross-linked and digested with the restriction enzyme HindIII. The sticky ends of all fragments were biotinylated, ligated to each other to form chimeric circles, enriched, sheared, and processed into sequencing libraries wherein the individual templates were chimeras of the physically associated DNA molecules from the original cross-linking. Hi-C reads was generated by Illumina Sequencing platform. The paired-end reads were uniquely mapped onto the Bionano assembly, classified into 30 groups using 3d-DNA(20180922) as the final assembly, and referred to as PB_v1.0. The exact locations of each scaffold in the 30 groups were based on the collinearity between yak and cattle (UMD3.1.1).

    Seven domestic yak breeds and one wild yak were selected for whole-genome sequencing and assembly. DNA was extracted from the ears of the Tibetan breeds, the blood of the Sichuan breeds, and the skin of the wild yak from Kunlun Spring, Hoh Xil. A whole-genome shotgun strategy and next-generation sequencing (NGS) technologies were run on the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA). Each genome was sequenced with a combination of short-insert (180 bp and 500 bp) and long-insert (2 kbp and 5 kbp) DNA libraries. SOAPdenovo (v2.04) was used to assemble each genome.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430

3D-Genomics Database

RRID:SCR_007430, nif-0000-00553, 3D-Genomics Database (RRID:SCR_007430), 3D-GENOMICS

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 29, 2022
Description

THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

Search
Clear search
Close search
Google apps
Main menu