Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.
This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:
AFR - African
AMR - American
EUR - European
SAS - South Asian
EAS - East Asian
ALL - all participants in the 1000 Genomes Project
Each of the directories contains the following files:
F1: The concatenated fasta file ready to be used with search engines, contains the following:
Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population
Reference proteome as per Ensembl v. 110
Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.
F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
F3: Translations of haplotype cDNA sequences, before merging with the reference proteome
For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.
For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.
When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies our publication titled: "Single-nucleus RNA-seq and ATAC-seq in outbred rats with divergent cocaine addiction behaviors reveal long-term changes in gene regulation and GABAergic inhibition in the amygdala."
Files Included:
geno.N26.vcf.gz
pred_expr.Brain.N26.tsv
Behavioral data.xlsx
Additional Dataset Locations:
The primary datasets generated during this study can be found on the Gene Expression Omnibus under accession number GSE212417
Publicly Available Datasets Utilized:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for Project: "BIT: Bayesian Identification of Transcriptional Regulators from Epigenomics-Based Query Region Sets"
BIT package is available on GitHub: GitHub
We also provide a online web portal: BIT Portal
Please consult the manual for instructions on loading the reference data.
Please note that the preprocessed reference database must be pre-loaded before running function in BIT!!
File Description:
hg38_200.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome hg38 with bin width 200.
hg38_500.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome hg38 with bin width 500.
hg38_1000.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome hg38 with bin width 1000.
mm10_200.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome mm10 with bin width 200.
mm10_500.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome mm10 with bin width 500.
mm10_1000.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome mm10 with bin width 1000.
Input_Data.tar.gz: contains the input data for the four application cases, including differentially accessible regions (DARs) from bulk and single-cell perturbation experiments, cancer-type-specific accessible regions, and cell-type-specific accessible regions.
Figure_Data_v2.tar.gz: is the updated figure data folder, which includes the data used to generate the manuscript’s plots, as well as the output from the benchmarking methods.
Figure.R: R code to replicate the figures, used together with Figure_Data_v2.tar.gz.
Depmap data can be accessed on DepMap Consortium: DepMap
MicrobesOnline is designed specifically to facilitate comparative studies on prokaryotic genomes. It is an entry point for operon, regulons, cis-regulatory and network predictions based on comparative analysis of genomes. The portal includes over 1000 complete genomes of bacteria, archaea and fungi and thousands of expression microarrays from diverse organisms ranging from model organisms such as Escherichia coli and Saccharomyces cerevisiae to environmental microbes such as Desulfovibrio vulgaris and Shewanella oneidensis. To assist in annotating genes and in reconstructing their evolutionary history, MicrobesOnline includes a comparative genome browser based on phylogenetic trees for every gene family as well as a species tree. To identify co-regulated genes, MicrobesOnline can search for genes based on their expression profile, and provides tools for identifying regulatory motifs and seeing if they are conserved. MicrobesOnline also includes fast phylogenetic profile searches, comparative views of metabolic pathways, operon predictions, a workbench for sequence analysis and integration with RegTransBase and other microbial genome resources. The next update of MicrobesOnline will contain significant new functionality, including comparative analysis of metagenomic sequence data. Programmatic access to the database, along with source code and documentation, is available at http://microbesonline.org/programmers.html.
The test files used for the mosaicism-nextflow pipeline, owned by CIBERER pipelines, are processed files. The raw data originates from sample NA18278 from the GIAB project, accessible at https://www.internationalgenome.org/data-portal/search?q=NA12878.
The specific region selected for analysis is: 17:7577873-7580187.
The test dataset for CNV (test_set.tar.gz), generated in silico by VISOR (https://doi.org/10.1093/bioinformatics/btz719), consists of 8 combinations of 4 different haplotypes derived from chromosome 22 (GRCh38), at an average coverage of 30x.
A small test set for CNVs (CNV_test_set.tar.gz) consists on 10 samples from 1000 genomes project (hs37d5), each of them with a deletion in one exon of BRCA1.
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Whole genome sequencing was performed on Gamarada debralockiae gen. nov. sp. nov. Following sequence assemblies, contigs greater than 1,000 nucleotides were subjected to gene annotation and analysis. We provide here the remaining contigs which were less than 1,000 nucleotides in length.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.
This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:
AFR - African
AMR - American
EUR - European
SAS - South Asian
EAS - East Asian
ALL - all participants in the 1000 Genomes Project
Each of the directories contains the following files:
F1: The concatenated fasta file ready to be used with search engines, contains the following:
Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population
Reference proteome as per Ensembl v. 110
Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.
F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
F3: Translations of haplotype cDNA sequences, before merging with the reference proteome
For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.
For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.
When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0