7 datasets found
  1. Z

    Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vasicek, Jakub (2024). Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10149277
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    University of Bergen
    Authors
    Vasicek, Jakub
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.

    This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:

    AFR - African

    AMR - American

    EUR - European

    SAS - South Asian

    EAS - East Asian

    ALL - all participants in the 1000 Genomes Project

    Each of the directories contains the following files:

    F1: The concatenated fasta file ready to be used with search engines, contains the following:

    Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population

    Reference proteome as per Ensembl v. 110

    Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)

    The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.

    F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes

    F3: Translations of haplotype cDNA sequences, before merging with the reference proteome

    For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.

    For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.

    When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0

  2. u

    Data from: Test data for CIBERER nextflow pipelines

    • portalcientifico.unav.edu
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel; Benítez Quesada, Yolanda; Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel; Benítez Quesada, Yolanda (2025). Test data for CIBERER nextflow pipelines [Dataset]. https://portalcientifico.unav.edu/documentos/67321e7daea56d4af048579b?lang=eu
    Explore at:
    Dataset updated
    2025
    Authors
    Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel; Benítez Quesada, Yolanda; Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel; Benítez Quesada, Yolanda
    Description

    The test files used for the mosaicism-nextflow pipeline, owned by CIBERER pipelines, are processed files. The raw data originates from sample NA18278 from the GIAB project, accessible at https://www.internationalgenome.org/data-portal/search?q=NA12878.

    The specific region selected for analysis is: 17:7577873-7580187.

    The test dataset for CNV (test_set.tar.gz), generated in silico by VISOR (https://doi.org/10.1093/bioinformatics/btz719), consists of 8 combinations of 4 different haplotypes derived from chromosome 22 (GRCh38), at an average coverage of 30x.

    A small test set for CNVs (CNV_test_set.tar.gz) consists on 10 samples from 1000 genomes project (hs37d5), each of them with a deletion in one exon of BRCA1.

  3. Allele frequencies between Bos taurus and Bos indicus

    • data.csiro.au
    • researchdata.edu.au
    Updated Jun 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina Naval Sanchez; Laercio Porto Neto; Hans Daetwyler; Ben Hayes; Toni Reverter-Gomez (2019). Allele frequencies between Bos taurus and Bos indicus [Dataset]. http://doi.org/10.25919/5ceb24e4ae2f8
    Explore at:
    Dataset updated
    Jun 17, 2019
    Dataset provided by
    CSIROhttps://www.csiro.au/
    Authors
    Marina Naval Sanchez; Laercio Porto Neto; Hans Daetwyler; Ben Hayes; Toni Reverter-Gomez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Queensland Alliance for Agriculture and Food Innovation, The University of Queensland
    Agriculture Victoria, AgriBio, Centre for AgriBioscience
    School of Applied Systems Biology, La Trobe University
    CSIROhttps://www.csiro.au/
    Description

    Data related to Naval-Sanchez et al. 2019 "Selection Signatures in heat-resistant cattle reveal missense mutations in damage response gene HELB". Samples: Whole genome sequences from the 1000 Bull Genomes Project (Run6, Bos taurus, and Bos indicus) for breeds chosen as a reference for imputation were retrieved (Daetwyler et al. 2014; Hayes and Daetwyler 2018). This results in 440 whole-genome sequences across 18 cattle breeds. Breeds were grouped in accordance to their phenotypes and reported genomic crosses as taurine (humpless), indicine (with hump), admixed or African sanga, the two latter being stabilized composite breeds (Rege J 1999; Hanotte et al. 2002; Rege J et al. 2007; Mwai et al. 2015; Felius, Marleen et al. 2016). The dataset contains 186 European Taurine, 102 Asiatic indicine and 80 cross-bred genomes as well as a set of African samples composed of 12 taurine, 41 sanga and 19 indicine.

    Lineage: Mapping variant detection and imputation: Genetic variants from the sequenced animals were extracted and filtered to only keep bi-allelic variants with minimally four copies of the minor allele. Genomes of filtered variants were phased using Eagle (Loh et al. 2016) and imputed using FImpute 2.2 (Sargolzaei et al. 2014). The analysis resulted in the detection of 39,679,303 high-quality SNPs, being 24,080,747 considered common SNPs (MAF >=0.05).

  4. Re-analysis of chromatin accessibility QTLs from the Kumasaka et al, 2018...

    • zenodo.org
    • datasetcatalog.nlm.nih.gov
    application/gzip
    Updated Sep 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaur Alasoo; Kaur Alasoo; Kuningas Kristiina; Kuningas Kristiina (2024). Re-analysis of chromatin accessibility QTLs from the Kumasaka et al, 2018 study [Dataset]. http://doi.org/10.5281/zenodo.13848268
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kaur Alasoo; Kaur Alasoo; Kuningas Kristiina; Kuningas Kristiina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ATAC-seq data from Kumasaka et al, 2018 was processed with the nf-core/atacseq v2.1.2 pipeline using Nextflow v23.09.3. We aligned raw ATAC-seq reads to the GRCh38 reference genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa downloaded from Ensembl) with BWA v0.7.17. We called broad peaks with MACS2 v2.2.7.1 and defined consensus peaks as the union of all peaks that were present in at least 5% of the samples. We then quantified read overlaps with the set of consensus peaks with featureCounts v2.0.1. Finally, we normalised the read counts (counts per million) and then used the inverse normal transformation to standardise the data distribution.

    Genotype data for the 91 overlapping samples were downloaded from 1000 Genomes 30x on GRCh38 website. Finally, we used the eQTL-Catalogue/qtlmap v24.01.1 workflow to perform chromatin accessibility QTL analysis. We set cis window size to 200,000 bp and excluded peaks that had less than 25 variants within that window. More details of the association testing workflow can be found here.

  5. r

    Cancer Cell Line Encyclopedia

    • rrid.site
    Updated Aug 21, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2010). Cancer Cell Line Encyclopedia [Dataset]. http://identifiers.org/RRID:SCR_013836
    Explore at:
    Dataset updated
    Aug 21, 2010
    Description

    A collaborative project between the Broad Institute and the Novartis Institutes for Biomedical Research and its Genomics Institute of the Novartis Research Foundation, with the goal of conducting a detailed genetic and pharmacologic characterization of a large panel of human cancer models. The CCLE also works to develop integrated computational analyses that link distinct pharmacologic vulnerabilities to genomic patterns and to translate cell line integrative genomics into cancer patient stratification. The CCLE provides public access to genomic data, analysis and visualization for about 1000 cell lines.

  6. OzOat Diversity Panel: A Global Genome-to-Phenome Resource for Oat Breeding

    • data.csiro.au
    Updated Dec 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meredith McNeil; Bill Bovill; Tina Rathjen; Ben Trevaskis; Felicity Harris; Allan Rattey; Scott Boden (2025). OzOat Diversity Panel: A Global Genome-to-Phenome Resource for Oat Breeding [Dataset]. http://doi.org/10.25919/9byg-4635
    Explore at:
    Dataset updated
    Dec 31, 2025
    Dataset provided by
    CSIROhttps://www.csiro.au/
    Authors
    Meredith McNeil; Bill Bovill; Tina Rathjen; Ben Trevaskis; Felicity Harris; Allan Rattey; Scott Boden
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 1, 2020 - Nov 29, 2024
    Area covered
    Dataset funded by
    InterGrain Pty Ltd
    University of Adelaide
    CSIROhttps://www.csiro.au/
    NSW Primary Industries
    Charles Sturt University
    Description

    The OzOat diversity panel is a curated collection of approximately 319 oat accessions representing global genetic diversity and the breeding history of Australian oats. The panel spans material from historical landraces (circa 1892) to modern cultivars, including key international introductions and donor lines. Designed to maximise recombination and capture broad phenotypic diversity, the OzOat panel underpins gene discovery and marker development for a wide range of traits. High-density SNP genotyping, transcriptome sequencing, and curated pedigree information provide a powerful genome-to-phenome platform for accelerating genetic discovery and breeding innovation. Combined these resources enable the development of oat varieties with improved resilience, productivity, and quality tailored to Australian farming systems. Links to genomic, pedigree, and marker–trait association datasets generated as part of the GRDC project: Optimising genetic control of oat phenology for Australia (CSP2007-002RTX) are provided within this DAP entry. Lineage: GRDC project: Optimising genetic control of oat phenology for Australia CSP2007-002RTX supported the foundational work to establish the panel, concepts, GWAS/TWAS pipelines underpinning the 'OzOat' panel. The types of data generated include:

    SNP data: - Genomic SNPs from DArTSeq SNP platform - SNP Haplotype file combining DArTSeq SNPs and transcriptome SNPs generated after removal of missing or poor-quality data (genotypes with >50% missing data removed, SNPs with >20% missing data removed). Monomorphic markers and those with a minor allele frequency less than 5% were also removed. File sorted by physical chromosome position of SNP in Oat Sang_v0 and Oat OT3098_v1 (PepsiCo) reference genomes.

    Phenotype data: 1. Controlled environment (long day, long day with vernalisation, short day) conditions at Adelaide University (Scott Boden), for the complete OzOat panel. 2. Field experiments were conducted at Wagga Wagga, New South Wales (147.3°E, 35.1°S, elevation ~210 m) in 2021 and 2022. A subset of the OzOat panel were evaluated. In 2021, 80 oat genotypes were sown at two sowing dates (7 May and 2 June), and in 2022, 60 genotypes were sown at three sowing dates (14 April, 3 May, and 24 May).

    Pedigree: The Helium Pedigree Visualisation Framework (Shaw et al. 2014) was utilised to view over 1000 international accessions spanning diversity relevant to the history of Australian oat breeding. From this list, 319 oat lines were selected to form the OzOat panel. A Helium-compatible csv file contains all known ancestors for these 1000 oat accessions, lines selected for inclusion in the OzOat panel are highlighted in green. Pedigrees were obtained from the “Pedigrees of Oat Lines” POOL database (Tinker and Deyl, 2005), from Fitzsimmons et al. (1983) and directly from oat breeders (Dr Pamela Zwer and Dr Bruce Winter, personal communication).

    Fitzsimmons, R. W., Roberts, G. L., and Wrigley, C. W. (1983). Australian Oat varieties (Melbourne: CSIRO Publishing). doi: 10.1071/9780643105447 Shaw, P. D., Graham, M., Kennedy, J., Milne, I., and Marshall, DF (2014). Helium: visualization of large scale plant pedigrees. BMC Bioinf. 15, 259. doi: 10.1186/1471-2105-15-259 Tinker, N. A., and Deyl, J. K. (2005). A curated internet database of oat pedigrees. Crop Sci. 45, 2269–2272. doi: 10.2135/cropsci2004.0687

  7. Pan-cancer Aberrant Pathway Activity Analysis (PAPAA)

    • zenodo.org
    application/gzip, csv +1
    Updated Dec 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DANIEL BLANKENBERG; DANIEL BLANKENBERG; VIJAY NAGAMPALLI; VIJAY NAGAMPALLI (2020). Pan-cancer Aberrant Pathway Activity Analysis (PAPAA) [Dataset]. http://doi.org/10.5281/zenodo.3625201
    Explore at:
    tsv, application/gzip, csvAvailable download formats
    Dataset updated
    Dec 5, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    DANIEL BLANKENBERG; DANIEL BLANKENBERG; VIJAY NAGAMPALLI; VIJAY NAGAMPALLI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information about the dataset files:

    1) pancan_rnaseq_freeze.tsv.gz: Publicly available gene expression data for the TCGA Pan-cancer dataset. File: PanCanAtlas EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611] [https://doi.org/10.1016/j.celrep.2018.03.046]

    2) pancan_mutation_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset. File: mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]

    3) pancan_GISTIC_threshold.tsv.gz: Publicly available Gene- level copy number information of the TCGA Pan-cancer dataset. This file is processed using script process_copynumber.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. The files copy_number_loss_status.tsv.gz and copy_number_gain_status.tsv.gz generated from this data are used as inputs in our Galaxy pipeline. [https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443] [https://doi.org/10.1016/j.celrep.2018.03.046]

    4) mutation_burden_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/][http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]

    5) sample_freeze.tsv or sample_freeze_version4_modify.tsv: The file lists the frozen samples as determined by TCGA PanCancer Atlas consortium along with raw RNAseq and mutation data. These were previously determined and included for all downstream analysis All other datasets were processed and subset according to the frozen samples.[https://github.com/greenelab/pancancer/]

    6) vogelstein_cancergenes.tsv: compendium of OG and TSG used for the analysis. [https://github.com/greenelab/pancancer/]

    7) CCLE_DepMap_18Q1_maf_20180207.txt.gz Publicly available Mutational data for CCLE cell lines from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2FCCLE_DepMap_18Q1_maf_20180207.txt]

    8) ccle_rnaseq_genes_rpkm_20180929.gct.gz: Publicly available Expression data for 1019 cell lines (RPKM) from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2Fccle_2019%2FCCLE_RNAseq_genes_rpkm_20180929.gct.gz]

    9) CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct: Publicly available merged Mutational and copy number alterations that include gene amplifications and deletions for the CCLE cell lines. This data is represented in the binary format and provided by the Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://data.broadinstitute.org/ccle_legacy_data/binary_calls_for_copy_number_and_mutation_data/CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct]

    10) GDSC_cell_lines_EXP_CCLE_names.csv.gz Publicly available RMA normalized expression data for Genomics of Drug Sensitivity in Cancer(GDSC) cell-lines. File gdsc_cell_line_RMA_proc_basalExp.csv was downloaded. This data was subsetted to 389 cell lines that are common among CCLE and GDSC. All the GDSC cell line names were replaced with CCLE cell line names for further processing. [https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip]

    11) GDSC_CCLE_common_mut_cnv_binary.csv.gz: A subset of merged Mutational and copy number alterations that include gene amplifications and deletions for common cell lines between GDSC and CCLE. This file is generated using CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct and a list of common cell lines.

    12) gdsc1_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC1 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_15Oct19.xlsx]

    13) gdsc2_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC2 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC2_fitted_dose_response_15Oct19.xlsx]

    14) compounds.csv: list of pharmacological compounds tested for our analysis

    15) tcga_dictonary.tsv: list of cancer types used in the analysis.

    16) seg_based_scores.tsv: Measurement of total copy number burden, Percent of genome altered by copy number alterations. This file was used as part of the Pancancer analysis by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/]

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vasicek, Jakub (2024). Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10149277

Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set

Explore at:
Dataset updated
Dec 11, 2024
Dataset provided by
University of Bergen
Authors
Vasicek, Jakub
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.

This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:

AFR - African

AMR - American

EUR - European

SAS - South Asian

EAS - East Asian

ALL - all participants in the 1000 Genomes Project

Each of the directories contains the following files:

F1: The concatenated fasta file ready to be used with search engines, contains the following:

Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population

Reference proteome as per Ensembl v. 110

Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)

The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.

F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes

F3: Translations of haplotype cDNA sequences, before merging with the reference proteome

For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.

For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.

When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0

Search
Clear search
Close search
Google apps
Main menu