Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.
This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:
AFR - African
AMR - American
EUR - European
SAS - South Asian
EAS - East Asian
ALL - all participants in the 1000 Genomes Project
Each of the directories contains the following files:
F1: The concatenated fasta file ready to be used with search engines, contains the following:
Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population
Reference proteome as per Ensembl v. 110
Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.
F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
F3: Translations of haplotype cDNA sequences, before merging with the reference proteome
For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.
For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.
When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0
Facebook
TwitterThe test files used for the mosaicism-nextflow pipeline, owned by CIBERER pipelines, are processed files. The raw data originates from sample NA18278 from the GIAB project, accessible at https://www.internationalgenome.org/data-portal/search?q=NA12878.
The specific region selected for analysis is: 17:7577873-7580187.
The test dataset for CNV (test_set.tar.gz), generated in silico by VISOR (https://doi.org/10.1093/bioinformatics/btz719), consists of 8 combinations of 4 different haplotypes derived from chromosome 22 (GRCh38), at an average coverage of 30x.
A small test set for CNVs (CNV_test_set.tar.gz) consists on 10 samples from 1000 genomes project (hs37d5), each of them with a deletion in one exon of BRCA1.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data related to Naval-Sanchez et al. 2019 "Selection Signatures in heat-resistant cattle reveal missense mutations in damage response gene HELB". Samples: Whole genome sequences from the 1000 Bull Genomes Project (Run6, Bos taurus, and Bos indicus) for breeds chosen as a reference for imputation were retrieved (Daetwyler et al. 2014; Hayes and Daetwyler 2018). This results in 440 whole-genome sequences across 18 cattle breeds. Breeds were grouped in accordance to their phenotypes and reported genomic crosses as taurine (humpless), indicine (with hump), admixed or African sanga, the two latter being stabilized composite breeds (Rege J 1999; Hanotte et al. 2002; Rege J et al. 2007; Mwai et al. 2015; Felius, Marleen et al. 2016). The dataset contains 186 European Taurine, 102 Asiatic indicine and 80 cross-bred genomes as well as a set of African samples composed of 12 taurine, 41 sanga and 19 indicine.
Lineage: Mapping variant detection and imputation: Genetic variants from the sequenced animals were extracted and filtered to only keep bi-allelic variants with minimally four copies of the minor allele. Genomes of filtered variants were phased using Eagle (Loh et al. 2016) and imputed using FImpute 2.2 (Sargolzaei et al. 2014). The analysis resulted in the detection of 39,679,303 high-quality SNPs, being 24,080,747 considered common SNPs (MAF >=0.05).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ATAC-seq data from Kumasaka et al, 2018 was processed with the nf-core/atacseq v2.1.2 pipeline using Nextflow v23.09.3. We aligned raw ATAC-seq reads to the GRCh38 reference genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa downloaded from Ensembl) with BWA v0.7.17. We called broad peaks with MACS2 v2.2.7.1 and defined consensus peaks as the union of all peaks that were present in at least 5% of the samples. We then quantified read overlaps with the set of consensus peaks with featureCounts v2.0.1. Finally, we normalised the read counts (counts per million) and then used the inverse normal transformation to standardise the data distribution.
Genotype data for the 91 overlapping samples were downloaded from 1000 Genomes 30x on GRCh38 website. Finally, we used the eQTL-Catalogue/qtlmap v24.01.1 workflow to perform chromatin accessibility QTL analysis. We set cis window size to 200,000 bp and excluded peaks that had less than 25 variants within that window. More details of the association testing workflow can be found here.
Facebook
TwitterA collaborative project between the Broad Institute and the Novartis Institutes for Biomedical Research and its Genomics Institute of the Novartis Research Foundation, with the goal of conducting a detailed genetic and pharmacologic characterization of a large panel of human cancer models. The CCLE also works to develop integrated computational analyses that link distinct pharmacologic vulnerabilities to genomic patterns and to translate cell line integrative genomics into cancer patient stratification. The CCLE provides public access to genomic data, analysis and visualization for about 1000 cell lines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The OzOat diversity panel is a curated collection of approximately 319 oat accessions representing global genetic diversity and the breeding history of Australian oats. The panel spans material from historical landraces (circa 1892) to modern cultivars, including key international introductions and donor lines. Designed to maximise recombination and capture broad phenotypic diversity, the OzOat panel underpins gene discovery and marker development for a wide range of traits. High-density SNP genotyping, transcriptome sequencing, and curated pedigree information provide a powerful genome-to-phenome platform for accelerating genetic discovery and breeding innovation. Combined these resources enable the development of oat varieties with improved resilience, productivity, and quality tailored to Australian farming systems. Links to genomic, pedigree, and marker–trait association datasets generated as part of the GRDC project: Optimising genetic control of oat phenology for Australia (CSP2007-002RTX) are provided within this DAP entry. Lineage: GRDC project: Optimising genetic control of oat phenology for Australia CSP2007-002RTX supported the foundational work to establish the panel, concepts, GWAS/TWAS pipelines underpinning the 'OzOat' panel. The types of data generated include:
SNP data: - Genomic SNPs from DArTSeq SNP platform - SNP Haplotype file combining DArTSeq SNPs and transcriptome SNPs generated after removal of missing or poor-quality data (genotypes with >50% missing data removed, SNPs with >20% missing data removed). Monomorphic markers and those with a minor allele frequency less than 5% were also removed. File sorted by physical chromosome position of SNP in Oat Sang_v0 and Oat OT3098_v1 (PepsiCo) reference genomes.
Phenotype data: 1. Controlled environment (long day, long day with vernalisation, short day) conditions at Adelaide University (Scott Boden), for the complete OzOat panel. 2. Field experiments were conducted at Wagga Wagga, New South Wales (147.3°E, 35.1°S, elevation ~210 m) in 2021 and 2022. A subset of the OzOat panel were evaluated. In 2021, 80 oat genotypes were sown at two sowing dates (7 May and 2 June), and in 2022, 60 genotypes were sown at three sowing dates (14 April, 3 May, and 24 May).
Pedigree: The Helium Pedigree Visualisation Framework (Shaw et al. 2014) was utilised to view over 1000 international accessions spanning diversity relevant to the history of Australian oat breeding. From this list, 319 oat lines were selected to form the OzOat panel. A Helium-compatible csv file contains all known ancestors for these 1000 oat accessions, lines selected for inclusion in the OzOat panel are highlighted in green. Pedigrees were obtained from the “Pedigrees of Oat Lines” POOL database (Tinker and Deyl, 2005), from Fitzsimmons et al. (1983) and directly from oat breeders (Dr Pamela Zwer and Dr Bruce Winter, personal communication).
Fitzsimmons, R. W., Roberts, G. L., and Wrigley, C. W. (1983). Australian Oat varieties (Melbourne: CSIRO Publishing). doi: 10.1071/9780643105447 Shaw, P. D., Graham, M., Kennedy, J., Milne, I., and Marshall, DF (2014). Helium: visualization of large scale plant pedigrees. BMC Bioinf. 15, 259. doi: 10.1186/1471-2105-15-259 Tinker, N. A., and Deyl, J. K. (2005). A curated internet database of oat pedigrees. Crop Sci. 45, 2269–2272. doi: 10.2135/cropsci2004.0687
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information about the dataset files:
1) pancan_rnaseq_freeze.tsv.gz: Publicly available gene expression data for the TCGA Pan-cancer dataset. File: PanCanAtlas EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611] [https://doi.org/10.1016/j.celrep.2018.03.046]
2) pancan_mutation_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset. File: mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]
3) pancan_GISTIC_threshold.tsv.gz: Publicly available Gene- level copy number information of the TCGA Pan-cancer dataset. This file is processed using script process_copynumber.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. The files copy_number_loss_status.tsv.gz and copy_number_gain_status.tsv.gz generated from this data are used as inputs in our Galaxy pipeline. [https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443] [https://doi.org/10.1016/j.celrep.2018.03.046]
4) mutation_burden_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/][http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]
5) sample_freeze.tsv or sample_freeze_version4_modify.tsv: The file lists the frozen samples as determined by TCGA PanCancer Atlas consortium along with raw RNAseq and mutation data. These were previously determined and included for all downstream analysis All other datasets were processed and subset according to the frozen samples.[https://github.com/greenelab/pancancer/]
6) vogelstein_cancergenes.tsv: compendium of OG and TSG used for the analysis. [https://github.com/greenelab/pancancer/]
7) CCLE_DepMap_18Q1_maf_20180207.txt.gz Publicly available Mutational data for CCLE cell lines from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2FCCLE_DepMap_18Q1_maf_20180207.txt]
8) ccle_rnaseq_genes_rpkm_20180929.gct.gz: Publicly available Expression data for 1019 cell lines (RPKM) from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2Fccle_2019%2FCCLE_RNAseq_genes_rpkm_20180929.gct.gz]
9) CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct: Publicly available merged Mutational and copy number alterations that include gene amplifications and deletions for the CCLE cell lines. This data is represented in the binary format and provided by the Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://data.broadinstitute.org/ccle_legacy_data/binary_calls_for_copy_number_and_mutation_data/CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct]
10) GDSC_cell_lines_EXP_CCLE_names.csv.gz Publicly available RMA normalized expression data for Genomics of Drug Sensitivity in Cancer(GDSC) cell-lines. File gdsc_cell_line_RMA_proc_basalExp.csv was downloaded. This data was subsetted to 389 cell lines that are common among CCLE and GDSC. All the GDSC cell line names were replaced with CCLE cell line names for further processing. [https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip]
11) GDSC_CCLE_common_mut_cnv_binary.csv.gz: A subset of merged Mutational and copy number alterations that include gene amplifications and deletions for common cell lines between GDSC and CCLE. This file is generated using CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct and a list of common cell lines.
12) gdsc1_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC1 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_15Oct19.xlsx]
13) gdsc2_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC2 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC2_fitted_dose_response_15Oct19.xlsx]
14) compounds.csv: list of pharmacological compounds tested for our analysis
15) tcga_dictonary.tsv: list of cancer types used in the analysis.
16) seg_based_scores.tsv: Measurement of total copy number burden, Percent of genome altered by copy number alterations. This file was used as part of the Pancancer analysis by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/]
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.
This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:
AFR - African
AMR - American
EUR - European
SAS - South Asian
EAS - East Asian
ALL - all participants in the 1000 Genomes Project
Each of the directories contains the following files:
F1: The concatenated fasta file ready to be used with search engines, contains the following:
Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population
Reference proteome as per Ensembl v. 110
Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)
The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.
F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes
F3: Translations of haplotype cDNA sequences, before merging with the reference proteome
For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.
For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.
When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0