100+ datasets found

d
3D-Genomics Database
dknet.org
scicrunch.org
+2more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007430
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
f
Data from: Repeat elements organise 3D genome structure and mediate...
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Oct 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berry, Daniel; Winter, David J.; Dupont, Pierre-Yves; Young, Carolyn A.; Ganley, Austen R. D.; Cox, Murray P.; Scott, Barry; Ram, Arvina; Schardl, Christopher L.; Liachko, Ivan (2018). Repeat elements organise 3D genome structure and mediate transcription in the filamentous fungus Epichloë festucae [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000680242
Explore at:
Dataset updated
Oct 24, 2018
Authors
Berry, Daniel; Winter, David J.; Dupont, Pierre-Yves; Young, Carolyn A.; Ganley, Austen R. D.; Cox, Murray P.; Scott, Barry; Ram, Arvina; Schardl, Christopher L.; Liachko, Ivan
Description
Structural features of genomes, including the three-dimensional arrangement of DNA in the nucleus, are increasingly seen as key contributors to the regulation of gene expression. However, studies on how genome structure and nuclear organisation influence transcription have so far been limited to a handful of model species. This narrow focus limits our ability to draw general conclusions about the ways in which three-dimensional structures are encoded, and to integrate information from three-dimensional data to address a broader gamut of biological questions. Here, we generate a complete and gapless genome sequence for the filamentous fungus, Epichloë festucae. We use Hi-C data to examine the three-dimensional organisation of the genome, and RNA-seq data to investigate how Epichloë genome structure contributes to the suite of transcriptional changes needed to maintain symbiotic relationships with the grass host. Our results reveal a genome in which very repeat-rich blocks of DNA with discrete boundaries are interspersed by gene-rich sequences that are almost repeat-free. In contrast to other species reported to date, the three-dimensional structure of the genome is anchored by these repeat blocks, which act to isolate transcription in neighbouring gene-rich regions. Genes that are differentially expressed in planta are enriched near the boundaries of these repeat-rich blocks, suggesting that their three-dimensional orientation partly encodes and regulates the symbiotic relationship formed by this organism.
Machine learning reveals the diversity of human 3D chromatin contact...
zenodo.org
bin, csv, txt, zip
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erin Gilbertson; Erin Gilbertson (2024). Machine learning reveals the diversity of human 3D chromatin contact patterns (example predictions genome wide) [Dataset]. http://doi.org/10.5281/zenodo.13900918
Explore at:
zip, txt, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13900918
Dataset updated
Oct 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Erin Gilbertson; Erin Gilbertson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data for the paper: Machine learning reveals the diversity of human 3D chromatin contact patterns

GitHub: https://github.com/erin-n-gilbertson/3DGenome-diversity/tree/main

biorXiv: https://www.biorxiv.org/content/10.1101/2023.12.22.573104v1.full

Manuscript accepted at Molecular Biology and Evolution

Of primary interest will be the example predictions genome wide for hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual per genome along with the Jupyter notebook tutorial for making your own Akita predictions given any input 1MB sequence.

bin: contains python script for and qsub array shell script for generating example predictions. These scripts can be modified to take in any fasta files as input.

akita_predictions: contains both Akita prediction output arrays and SVG files with predicted contact maps for the hg38 reference, human-archaic hominin ancestor and most divergent 1KG individual in each of 4,873 1MB windows

anc_window_spearman.csv: spearman correlation between each 1KG individual and the ancestor for each 1MB window. To calculate 3D divergence subtract these values from 1.

basenji: basenji dir from their github, necessary in the directory to run predictions - https://github.com/calico/basenji/tree/master

genomes: fasta genomes for hg38 reference and human-archaic hominin ancestor used to make akita predictions

divergent_windows: variants and expected divergence distributions for 392 more divergent than expected windows. Defined in the manuscript as windows where 3D divergence between 1KG indiivudals and the ancestor is greater than what would be expected based on sequence divergence. See manuscript Fig. S9 for more details.

windows.txt: 4,873 1MB genomic windows with 100% coverage in hg38 used for Akita predictions

making_examples.ipynb: jupyter notebook with tutorial instructions for making Akita predictions on any human genome sequence.
d
Data from: 3D genomics across the tree of life identifies condensin II as a...
search.dataone.org
dataverse.harvard.edu
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoencamp, Claire (2023). Data from: 3D genomics across the tree of life identifies condensin II as a determinant of architecture type [Dataset]. http://doi.org/10.7910/DVN/UROKAG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UROKAG
Dataset updated
Nov 20, 2023
Dataset provided by
Harvard Dataverse
Authors
Hoencamp, Claire
Description
We analyzed conservation of condensin II complex in 24 species across the tree of life subunits with a multistep BLAST approach. The data found here is the BLAST alignments for these searches. The first searches were conducted in October/November 2019 and were manually double-checked in February and March 2020. Searches for other organisms were conducted in June 2020. All alignments were posted in: Our approach was based on a search strategy as used in earlier work by King et al. (https://doi.org/10.1093/molbev/msz140). We started by collecting publicly available protein sequences of the condensin I and II complex subunits of four diverse species from Uniprot: Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans and Arabidopsis thaliana. As a positive control we searched for SMC2 and SMC4, and the condensin I subunits, which are thought to be essential in all species. In the first alignment step, we used tblastn to search with the translated protein sequences of the above species against the nucleotide collection (nr/nt) database of the target species. The Expect threshold was set at 0.05. We reported an alignment as a hit when it had an E-value of 1E-10 or less with multiple regions of alignment. If there was an alignment with less confidence, we did an extra validation step to confirm the alignment. This step entailed downloading the translated nucleotide sequence of the putative alignment and using tblastn to search against the genome of a closely related organism with an annotated genome. If this search yielded the putative protein we used as a bait, we considered the hit validated. In the second alignment step we used the same approach, but we blasted against the wgs database of the target species. We again used 1E-10 as E-value cut-off. In the third step, only a few organisms still had missing subunits. To make an extra effort to find these subunits, we used the corresponding subunits of the nearest neighbour, which we identified in step 1 or 2, as bait. As the identified subunits were all nucleotide sequences, we used tblastx to translate these query sequences to protein sequences and blast against a translated nucleotide database. In this step we searched both the nr/nt database and the wgs database. As we were able to identify all SMC2/4 subunits, but still missed condensin II subunits we are now fairly sure these organisms indeed miss these condensin II subunits. However, it is still possible these organisms do have all condensin II subunits, but with very low sequence conservation. We were also able to identify the condensin I subunits in almost all species, with two notable exceptions (see Table S4). The Arctic lamprey lacked condensin I subunits CAPG and CAPD2. Because we were able to identify all condensin II subunits in this organism, we still included this species in our analysis. The other exception is the tardigrade. In this species we identified SMC2 and SMC4, but could not identify any of the accessory subunits of condensin I nor II. There are multiple possible explanations for this. On the one hand, it might have a biological explanation, for example in this organism condensin’s accessory subunits have evolved beyond recognition with our methods, or this species indeed has lost both condensin I and II. On the other hand, the missing subunits may be explained by a technical issue, e.g. the quality of the databases. Therefore we cannot with full certainty conclude that condensin II is indeed missing in the tardigrade, and this will need to be investigated further.
o
Data from: In situ genome sequencing resolves DNA sequence and structure in...
idr.openmicroscopy.org
Updated Dec 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). In situ genome sequencing resolves DNA sequence and structure in intact biological samples [Dataset]. https://idr.openmicroscopy.org/study/idr0101/
Explore at:
Dataset updated
Dec 31, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Understanding genome organization requires integration of DNA sequence and 3D spatial context, however, existing genome-wide methods lack either base-pair sequence resolution or direct spatial localization. Here, we describe in situ genome sequencing (IGS), a method for simultaneously sequencing and imaging genomes within intact biological samples. We applied IGS to human fibroblasts and early mouse embryos, spatially localizing thousands of genomic loci in individual nuclei. Using these data, we characterized parent-specific changes in genome structure across embryonic stages, revealed single-cell chromatin domains in zygotes, and uncovered epigenetic memory of global chromosome positioning within individual embryos. These results demonstrate how in situ genome sequencing can directly connect sequence and structure across length scales from single base pairs to whole organisms.
r
Gene3D
rrid.site
scicrunch.org
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Gene3D [Dataset]. http://identifiers.org/RRID:SCR_007672
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007672
Dataset updated
Aug 23, 2025
Description
A large database of CATH protein domain assignments for ENSEMBL genomes and Uniprot sequences. Gene3D is a resource of form studying proteins and the component domains. Gene3D takes CATH domains from Protein Databank (PDB) structures and assigns them to the millions of protein sequences with no PDB structures using Hidden Markov models. Assigning a CATH superfamily to a region of a protein sequence gives information on the gross 3D structure of that region of the protein. CATH superfamilies have a limited set of functions and so the domain assignment provides some functional insights. Furthermore most proteins have several different domains in a specific order, so looking for proteins with a similar domain organization provides further functional insights. Strict confidence cut-offs are used to ensure the reliability of the domain assignments. Gene3D imports functional information from sources such as UNIPROT, and KEGG. They also import experimental datasets on request to help researchers integrate there data with the corpus of the literature. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. The Gene3D web services provide programmatic access to the CATH-Gene3D annotation resources and in-house software tools. These services include Gene3DScan for identifying structural domains within protein sequences, access to pre-calculated annotations for the major sequence databases, and linked functional annotation from UniProt, GO and KEGG.
Orca: Sequence-based modeling of genome 3D architecture from kilobase to...
zenodo.org
application/gzip
Updated Mar 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jian Zhou; Jian Zhou (2021). Orca: Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale (Part2) [Dataset]. http://doi.org/10.5281/zenodo.4594676
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4594676
Dataset updated
Mar 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jian Zhou; Jian Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset (Part 2) provides additional chromatin tracks files required for using the chromatin track plotting functions of Orca. Orca is a sequence-based deep learning modeling framework for multiscale genome 3D architecture.
d
CYGD - Comprehensive Yeast Genome Database
dknet.org
test2.scicrunch.org
+1more
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). CYGD - Comprehensive Yeast Genome Database [Dataset]. http://identifiers.org/RRID:SCR_002289
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002289
Dataset updated
Aug 18, 2024
Description
The MIPS Comprehensive Yeast Genome Database (CYGD) aims to present information on the molecular structure and functional network of the entirely sequenced, well-studied model eukaryote, the budding yeast Saccharomyces cerevisiae. In addition, the data of various projects on related yeasts are used for comparative analysis.
n
GTOP - Genomes To Protein structures
neuinfo.org
scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). GTOP - Genomes To Protein structures [Dataset]. http://identifiers.org/RRID:SCR_007698
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007698
Dataset updated
Jan 29, 2022
Description
GTOP is a database consists of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures. GTOP is built by the Laboratory of Gene-Product Informatics at the National Institute of Genetics. This research is supported by the Japan Science and Technology Corporation and Grants-in-Aid for Scientific Research (Genomes in category C) from the Ministry of Education, Science, Sports and Culture of Japan. We use the following methods: Prediction of 3D structure Sequence homology search of PDB, using REVERSE PSI-BLAST. Functional predictions (family classifications) Sequence homology search of Swiss-Prot, a well-annotated sequence database, with the use of BLAST. Other analytical methods We are also carrying out the following analyses: Motif Analysis(PROSITE) Family classification(Pfam) Prediction of transmembrane helix domains(SOSUI) Prediction of coiled-coil regions(Multicoil) Repetitive sequence analysis(RepAlign)
d
Full genome and transcriptome sequence assembly of the non-model organism...
search.dataone.org
bco-dmo.org
+1more
Updated Dec 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crow White; Robert J. Toonen; Mark Christie; Jean Davidson; Paul Anderson; Benjamin Daniels; Andy Lee; Cataixa López (2024). Full genome and transcriptome sequence assembly of the non-model organism Kellet’s whelk, Kelletia kelletii [Dataset]. http://doi.org/10.26008/1912/bco-dmo.945292.1
Explore at:
Unique identifier
https://doi.org/10.26008/1912/bco-dmo.945292.1
Dataset updated
Dec 29, 2024
Dataset provided by
Biological and Chemical Oceanography Data Management Office (BCO-DMO)
Authors
Crow White; Robert J. Toonen; Mark Christie; Jean Davidson; Paul Anderson; Benjamin Daniels; Andy Lee; Cataixa López
Time period covered
Aug 28, 2019 - Jul 8, 2020
Area covered
Description
Description of linked resources for this dataset, all links can be found in the related dataset section.

All codes and parameters used for the bioinformatic analyses carried out are available at https://github.com/bndaniel/Kellets-whelk-genome-assembly and are archived at Zenodo, doi:10.5281/zenodo.13274364

The Kelletia kelletii genome and transcriptome produced by this study have been deposited in Dryad, doi:10.5061/dryad.w0vt4b8zn

All raw sequence data, including the PacBio sequel 2, Nanopore MinION, and Illumina NovaSeq DNA sequencing, as well as the Illumina NovaSeq RNA sequencing, are deposited in the NCBI Sequence Read Archive (SRA) under PRJNA999368 and PRJNA1000198

VCF files for the all-SNPs and DEG-SNPs data sets are deposited in Dryad, doi:10.5061/dryad.qbzkh18s3

The scripts used in this project are hosted in the public repository https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FChristieLab%2Fkellets_whelk_rnaseq&data=05%7C02%7Cadyork%40whoi.edu%7Cc1b676d5d2004ed3100108dd04f0793e%7Cd44c5cc6d18c46cc8abd4fdf5b6e5944%7C0%7C0%7C638672153067844580%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2FB%2FMmba0T8gh8%2FUec91D%2F0JMn5uI5eyvd1Sd1FvXCMg%3D&reserved=0\" id=\"OWAdf30ddcd-0f16-9d4e-99c0-ef54a2461129\" rel=\"noopener noreferrer\" target=\"_blank\" title=\"//github.com/ChristieLab/kellets_whelk_rnaseq. Click or tap if you trust this link.\">https://github.com/ChristieLab/kellets_whelk_rnaseq and archived at Zenodo, doi:10.5281/zenodo.14187737
Orca: Sequence-based modeling of genome 3D architecture from kilobase to...
zenodo.org
application/gzip
Updated Feb 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jian Zhou; Jian Zhou (2022). Orca: Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale (Part1) [Dataset]. http://doi.org/10.5281/zenodo.4594207
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4594207
Dataset updated
Feb 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jian Zhou; Jian Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset (Part 1) provide the core resource files required for using the code of Orca, including models and the hg38 reference genome (resources_core.tar.gz), and the micro-C mcool files required for extracting the experimental observations (resources_mcools.tar.gz). Orca is a sequence-based deep learning modeling framework for multiscale genome 3D architecture.
e
CATH-Gene3D
ebi.ac.uk
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Oct 21, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
r
GeneSpeed- A Database of Unigene Domain Organization
rrid.site
test2.scicrunch.org
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). GeneSpeed- A Database of Unigene Domain Organization [Dataset]. http://identifiers.org/RRID:SCR_002779
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002779
Dataset updated
Aug 26, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 16, 2013. Database and customized tools to study the PFAM protein domain content of the transcriptome for all expressed genes of Homo sapiens, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans tethered to both a genomics array repository database and a range of external information resources. GeneSpeed has merged information from several existing data sets including the Gene Ontology Consortium, InterPro, Pfam, Unigene, as well as micro-array datasets. GeneSpeed is a database of PFAM domain homology contained within Unigene. Because Unigene is a non-redundant dbEST database, this provides a wide encompassing overview of the domain content of the expressed transcriptome. We have structured the GeneSpeed Database to include a rich toolset allowing the investigator to study all domain homology, no matter how remote. As a result, homology cutoff score decisions are determined by the scientist, not by a computer algorithm. This quality is one of the novel defining features of the GeneSpeed database giving the user complete control of database content. In addition to a domain content toolset, GeneSpeed provides an assortment of links to external databases, a unique and manually curated Transcription Factor Classification list, as well as links to our newly evolving GeneSpeed BetaCell Database. GeneSpeed BetaCell is a micro-array depository combined with custom array analysis tools created with an emphasis around the meta analysis of developmental time series micro-array datasets and their significance in pancreatic beta cells.
MOESM5 of Highly efficient lipid production in the green alga Parachlorella...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano (2023). MOESM5 of Highly efficient lipid production in the green alga Parachlorella kessleri: draft genome and transcriptome endorsed by whole-cell 3D ultrastructure [Dataset]. http://doi.org/10.6084/m9.figshare.10038743.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10038743.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 9. P values of the present RNA-seq analysis.
f
MOESM3 of Highly efficient lipid production in the green alga Parachlorella...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano (2023). MOESM3 of Highly efficient lipid production in the green alga Parachlorella kessleri: draft genome and transcriptome endorsed by whole-cell 3D ultrastructure [Dataset]. http://doi.org/10.6084/m9.figshare.10038737.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10038737.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Authors
Shuhei Ota; Kenshiro Oshima; Tomokazu Yamazaki; Sangwan Kim; Zhe Yu; Mai Yoshihara; Kohei Takeda; Tsuyoshi Takeshita; Aiko Hirata; Kateřina Bišová; Vilém Zachleder; Masahira Hattori; Shigeyuki Kawano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 7. RPKM and fold change of all Parachlorella transcripts.
E
Nurminen et al ("GP2Men") Study Primary and Metastatic Prostate Cancer Whole...
ega-archive.org
Updated Oct 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Nurminen et al ("GP2Men") Study Primary and Metastatic Prostate Cancer Whole Genome Sequence Data [Dataset]. https://ega-archive.org/datasets/EGAD50000000005
Explore at:
Dataset updated
Oct 4, 2023
License
https://ega-archive.org/dacs/EGAC00001001309https://ega-archive.org/dacs/EGAC00001001309
Description
We used novel processing techniques to obtain whole genome data together with 3D anatomic and histomorphologic analysis in two men (GP5 and GP12) with high risk PrCa undergoing radical prostatectomy. A total of 22 whole genome-sequenced sites (16 primary cancer foci and 6 lymph node metastatic) were analyzed using evolutionary reconstruction tools and spatio-evolutionary models. Probability models were used to trace spatial and chronological origins of the primary tumor and metastases, chart their genetic drivers, and distinguish metastatic and non-metastatic subclones.
r
HUDSEN Human Gene Expression Spatial Database
rrid.site
dknet.org
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). HUDSEN Human Gene Expression Spatial Database [Dataset]. http://identifiers.org/RRID:SCR_006325/resolver?q=*&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006325 https://identifiers.org/RRID:SCR_006325/resolver?q=*&i=rrid
Dataset updated
Jul 14, 2025
Description
Database of a set of standard 3D virtual models at different stages of development from Carnegie Stages (CS) 12-23 (approximately 26-56 days post conception) in which various anatomical regions have been defined with a set of anatomical terms at various stages of development (known as an ontology). Experimental data is captured and converted to digital format and then mapped to the appropriate 3D model. The ontology is used to define sites of gene expression using a set of standard descriptions and to link the expression data to an ''''anatomical tree''''. Human data from stages CS12 to CS23 can be submitted to the HUDSEN Gene Expression Database. The anatomy ontology currently being used is based on the Edinburgh Human Developmental Anatomy Database which encompasses all developing structures from CS1 to CS20 but is not detailed for developing brain structures. The ontology is being extended and refined (by Prof Luis Puelles, University of Murcia, Spain) and will be incorporated into the HUDSEN database as it is developed. Expression data is annotated using two methods to denote sites of expression in the embryo: spatial annotation and text annotation. Additionally, many aspects of the detection reagent and specimen are also annotated during this process (assignment of IDs, nucleotide sequences for probes etc). There are currently two main ways to search HUDSEN - using a gene/protein name or a named anatomical structure as the query term. The entire contents of the database can be browsed using the data browser. Results may be saved. The data in HUDSEN is generated from both from researchers within the HUDSEN project, and from the wider scientific community. The HUDSEN human gene expression spatial database is a collaboration between the Institute of Human Genetics in Newcastle, UK, and the MRC Human Genetics Unit in Edinburgh, UK, and was developed as part of the Electronic Atlas of the Developing Human Brain (EADHB) project (funded by the NIH Human Brain Project). The database is based on the Edinburgh Mouse Atlas gene expression database (EMAGE), and is designed to be an openly available resource to the research community holding gene expression patterns during early human development.
d
GTOP - Genomes To Protein structures
dknet.org
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). GTOP - Genomes To Protein structures [Dataset]. http://identifiers.org/RRID:SCR_007698
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007698 https://identifiers.org/RRID:SCR_007698/resolver?q=&i=rrid
Dataset updated
Jan 29, 2022
Description
GTOP is a database consists of data analyses of proteins identified by various genome projects. This database mainly uses sequence homology analyses and features extensive utilization of information on three-dimensional structures. GTOP is built by the Laboratory of Gene-Product Informatics at the National Institute of Genetics. This research is supported by the Japan Science and Technology Corporation and Grants-in-Aid for Scientific Research (Genomes in category C) from the Ministry of Education, Science, Sports and Culture of Japan. We use the following methods: Prediction of 3D structure Sequence homology search of PDB, using REVERSE PSI-BLAST. Functional predictions (family classifications) Sequence homology search of Swiss-Prot, a well-annotated sequence database, with the use of BLAST. Other analytical methods We are also carrying out the following analyses: Motif Analysis(PROSITE) Family classification(Pfam) Prediction of transmembrane helix domains(SOSUI) Prediction of coiled-coil regions(Multicoil) Repetitive sequence analysis(RepAlign)
H
DNALongBench eQTL data
dataverse.harvard.edu
Updated Aug 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DNALongBench_Author (2025). DNALongBench eQTL data [Dataset]. http://doi.org/10.7910/DVN/YUP2G5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/YUP2G5
Dataset updated
Aug 10, 2025
Dataset provided by
Harvard Dataverse
Authors
DNALongBench_Author
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Modeling long-range DNA dependencies is crucial for understanding genome structure and function for a wide-range of biological contexts in health and disease. However, effectively capturing the extensive long-range dependencies between DNA sequences, spanning millions of base pairs as seen in tasks such as three- dimensional (3D) chromatin folding, remains a significant challenge. Additionally, a comprehensive benchmark suite for evaluating tasks reliant on long-range depen- dencies is notably absent. To address this gap, we introduce DNALONGBENCH, a benchmark dataset spanning five important genomics tasks that consider long- range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signal. In order to comprehensively assess DNALONGBENCH, we evaluate the performance of three baseline methods: a task- specific expert model, a convolutional neural network (CNN)-based model, and a fine-tuned DNA foundation model, HyenaDNA. We envision DNALONGBENCH with the potential to become a standardized resource facilitating comprehensive comparisons and rigorous evaluations of the emerging DNA sequence-based deep learning models that consider long-range dependencies.
n
Data from: A chromosome-scale reference genome and genome-wide genetic...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Oct 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin-cheng Zhong; Qiu-mei Ji; Jin-wei Xin; Zhi-xin Chai; Cheng-fu Zhang; Yangla Dawa; Sang Luo; Qiang Zhang; Zhandui Pingcuo; Min-sheng Peng; Yong Zhu; Han-wen Cao; Hui Wang; Jian-lin Han (2020). A chromosome-scale reference genome and genome-wide genetic variations elucidate adaptation in yak [Dataset]. http://doi.org/10.5061/dryad.jh9w0vt7x
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jh9w0vt7x
Dataset updated
Oct 29, 2020
Dataset provided by
Kunming Institute of Zoology
Chinese Academy of Agricultural Sciences
Southwest Minzu University
State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, Lhasa, P.R. China
Authors
Jin-cheng Zhong; Qiu-mei Ji; Jin-wei Xin; Zhi-xin Chai; Cheng-fu Zhang; Yangla Dawa; Sang Luo; Qiang Zhang; Zhandui Pingcuo; Min-sheng Peng; Yong Zhu; Han-wen Cao; Hui Wang; Jian-lin Han
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Yak is an important livestock for the people who lived in harsh and oxygen-deprived Qinghai-Tibetan Plateau and Hindu-Kush Himalayan Mountains. Although there is a yak genome be sequenced in 2012, the assembly is quite fragmented due to the limitation of Illumina sequencing technology. An accurate and complete reference genome is critical for studying genetic variation of a specie. Long-read sequences are more complete than short-read ones, and they have been successfully used for high-quality genome assembly in several species. Here, we present a high-quality assembly of the yak genome (PB_v1.0) at chromosome scale, which was constructed using long-read sequencing technology assisted by chromatin interaction technology. Compared to the previous yak genome assembly (BosGru_v2.0), the PB_v1.0 assembly has substantially improved chromosome sequence continuity, minimized repetitive structure ambiguity, and achieved gene model completeness. To intensively characterize genetic variation of yak, we generated de novo genome assemblies based on Illumina short reads of seven recognized domestic yak breeds from Tibet and Sichuan as well as one wild yak from Hoh Xil. By comparing these eight assemblies to the PB_v1.0 genome, we obtained a comprehensive map of yak genetic diversity at whole genome level and identified a few protein-coding genes that were absent from the PB_v1.0 assembly. Although wild yak suffered bottleneck effect, the genetic diversity of wild yak is still higher than that of domestic yak. By whole genome alignment, we identified breed-specific sequences and genes, this will help the breeds identification of yak.

Methods High-quality DNA was extracted from the peripheral blood of a female yak in Riwoqe County, Tibet. SMRT sequencing libraries were constructed with a Blood&Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany). A total of 142 SMRT cells generated 184.6 Gbp of subread bases with a mean read length of 9.5 kbp on a PacBio RS II instrument (Pacific Biosciences, Menlo Park, CA, USA). The Falcon (v. 0.5.0) pipeline was used for the initial assembly. The first step was to identify all overlaps in the raw reads. Then, the read error was corrected by leveraging the overlap information. The second step was to detect overlaps in the corrected reads. This step required no consensus calling. The final step was to generate the string graph assembly and the contig sequence output in FASTA format. To improve the quality of the initial assembly, 113.34 Gbp of Illumina short reads were generated from the same individual. Using Pilon(v1.23)8, 845,002 homozygous insertions, 166,908 deletions, and 2,355,196 substitutions were identified and corrected. DNA from the same individual used in the PacBio sequencing was extracted and processed according to BioNano Genomics guidelines. The raw data were assembled with the BioNano Solve (v. 3.1.00) assembly pipeline (BioNano Genomics, San Diego, CA, USA). The combination of this assembly with the initial one yielded a superior assembly with a scaffold N50 of 65.67 Mbp and a maximum scaffold length of 128.62 Mbp. Hi-C libraries were created from yak whole-blood cells, 2–5 million cells were cross-linked and digested with the restriction enzyme HindIII. The sticky ends of all fragments were biotinylated, ligated to each other to form chimeric circles, enriched, sheared, and processed into sequencing libraries wherein the individual templates were chimeras of the physically associated DNA molecules from the original cross-linking. Hi-C reads was generated by Illumina Sequencing platform. The paired-end reads were uniquely mapped onto the Bionano assembly, classified into 30 groups using 3d-DNA(20180922) as the final assembly, and referred to as PB_v1.0. The exact locations of each scaffold in the 30 groups were based on the collinearity between yak and cattle (UMD3.1.1).

Seven domestic yak breeds and one wild yak were selected for whole-genome sequencing and assembly. DNA was extracted from the ears of the Tibetan breeds, the blood of the Sichuan breeds, and the skin of the wild yak from Kunlun Spring, Hoh Xil. A whole-genome shotgun strategy and next-generation sequencing (NGS) technologies were run on the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA). Each genome was sequenced with a combination of short-insert (180 bp and 500 bp) and long-insert (2 kbp and 5 kbp) DNA libraries. SOAPdenovo (v2.04) was used to assemble each genome.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430

3D-Genomics Database

RRID:SCR_007430, nif-0000-00553, 3D-Genomics Database (RRID:SCR_007430), 3D-GENOMICS

Explore at:

9 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_007430

Dataset updated

Jan 29, 2022

Description

THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

Clear search

Close search

Google apps

Main menu

3D-Genomics Database

Data from: Repeat elements organise 3D genome structure and mediate...

Machine learning reveals the diversity of human 3D chromatin contact...

Data from: 3D genomics across the tree of life identifies condensin II as a...

Data from: In situ genome sequencing resolves DNA sequence and structure in...

Gene3D

Orca: Sequence-based modeling of genome 3D architecture from kilobase to...

CYGD - Comprehensive Yeast Genome Database

GTOP - Genomes To Protein structures

Full genome and transcriptome sequence assembly of the non-model organism...

Orca: Sequence-based modeling of genome 3D architecture from kilobase to...

CATH-Gene3D

GeneSpeed- A Database of Unigene Domain Organization

MOESM5 of Highly efficient lipid production in the green alga Parachlorella...

MOESM3 of Highly efficient lipid production in the green alga Parachlorella...

Nurminen et al ("GP2Men") Study Primary and Metastatic Prostate Cancer Whole...

HUDSEN Human Gene Expression Spatial Database

GTOP - Genomes To Protein structures

DNALongBench eQTL data

Data from: A chromosome-scale reference genome and genome-wide genetic...

3D-Genomics DatabaseSee More Versions

RRID:SCR_007430, nif-0000-00553, 3D-Genomics Database (RRID:SCR_007430), 3D-GENOMICS

3D-Genomics Database