7 datasets found
  1. Z

    Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vasicek, Jakub (2024). Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10149277
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset authored and provided by
    Vasicek, Jakub
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.

    This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:

    AFR - African

    AMR - American

    EUR - European

    SAS - South Asian

    EAS - East Asian

    ALL - all participants in the 1000 Genomes Project

    Each of the directories contains the following files:

    F1: The concatenated fasta file ready to be used with search engines, contains the following:

    Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population

    Reference proteome as per Ensembl v. 110

    Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)

    The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.

    F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes

    F3: Translations of haplotype cDNA sequences, before merging with the reference proteome

    For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.

    For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.

    When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0

  2. Z

    Data from: Single-nucleus RNA-seq and ATAC-seq in outbred rats with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammadi, Pejman (2023). Data from: Single-nucleus RNA-seq and ATAC-seq in outbred rats with divergent cocaine addiction behaviors reveal long-term changes in gene regulation and GABAergic inhibition in the amygdala [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8242457
    Explore at:
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Pokhrel, Narayan
    McVicker, Graham
    Ho, Aaron J.
    Telese, Francesca
    Chitre, Apurva S.
    Zhou, Jessica L.
    Kallupi, Marsida
    Carrette, Lieselot LG
    Mohammadi, Pejman
    de Guglielmo, Giordano
    Palmer, Abraham A.
    Munro, Daniel
    George, Olivier
    Li, Hai-Ri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies our publication titled: "Single-nucleus RNA-seq and ATAC-seq in outbred rats with divergent cocaine addiction behaviors reveal long-term changes in gene regulation and GABAergic inhibition in the amygdala."

    Files Included:

    1. geno.N26.vcf.gz

      • Description: Contains genotypes for 26 Heterogeneous Stock rats whose gene expression was predicted.
    2. pred_expr.Brain.N26.tsv

      • Description: This tab-delimited table contains predicted relative gene expression in the brain for 26 Heterogeneous Stock rats.
      • Details: Predictions were made for 8,997 genes from linear models based on cis-eQTLs from whole brain hemisphere tissue downloaded from the RatGTEx Portal. A gene is included in the table if it had at least one significant cis-eQTL, and if its predicted expression in these 26 animals had nonzero variance. The values in the table give the predicted log2(relative expression), where log2(2) = 1 is the baseline expression from the two haplotypes of the gene if it had only reference alleles at all its regulatory loci.
      • Additional Info: Predictions were generated using gene_expr_pred.py available at https://github.com/PejLab/gene_expr_pred An explanation of the prediction model is given in https://doi.org/10.1101/2022.01.28.478116
    3. Behavioral data.xlsx

      • Description: Contains behavioral data for the Heterogeneous Stock (HS) rats.
      • Organization: Each sheet in the file corresponds to data for a specific figure.

    Additional Dataset Locations:

    The primary datasets generated during this study can be found on the Gene Expression Omnibus under accession number GSE212417

    Publicly Available Datasets Utilized:

    • Rattus norvegicus Ensembl v98 reference genome and genome assembly: Rnor_6.0
    • JASPAR2022 transcription factor binding profiles for vertebrates: JASPAR
    • ENCODE Honeybadger 2 ChIP-seq: Broad Institute
    • Liu et al. 2019106 GWAS for tobacco and nicotine addiction summary statistics: PubMed
    • RatGTEx Portal tissue-specific cis-eQTLs: RatGTEx Portal
    • 1000 Genomes European reference panel: Alkes Group
    • KEGG pathways: KEGG API
  3. BIT: Bayesian Identification of Transcriptional Regulators from...

    • zenodo.org
    application/gzip, bin
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyu Lu; Zeyu Lu (2025). BIT: Bayesian Identification of Transcriptional Regulators from Epigenomics-Based Query Region Sets [Dataset]. http://doi.org/10.5281/zenodo.14231098
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zeyu Lu; Zeyu Lu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data used for Project: "BIT: Bayesian Identification of Transcriptional Regulators from Epigenomics-Based Query Region Sets"

    BIT package is available on GitHub: GitHub

    We also provide a online web portal: BIT Portal

    Please consult the manual for instructions on loading the reference data.

    Please note that the preprocessed reference database must be pre-loaded before running function in BIT!!

    File Description:

    hg38_200.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome hg38 with bin width 200.

    hg38_500.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome hg38 with bin width 500.

    hg38_1000.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome hg38 with bin width 1000.

    mm10_200.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome mm10 with bin width 200.

    mm10_500.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome mm10 with bin width 500.

    mm10_1000.tar.gz: Pre-processed TR ChIP-seq reference datasets for genome mm10 with bin width 1000.

    Input_Data.tar.gz: contains the input data for the four application cases, including differentially accessible regions (DARs) from bulk and single-cell perturbation experiments, cancer-type-specific accessible regions, and cell-type-specific accessible regions.

    Figure_Data_v2.tar.gz: is the updated figure data folder, which includes the data used to generate the manuscript’s plots, as well as the output from the benchmarking methods.

    Figure.R: R code to replicate the figures, used together with Figure_Data_v2.tar.gz.

    Depmap data can be accessed on DepMap Consortium: DepMap

  4. r

    MicrobesOnline

    • rrid.site
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). MicrobesOnline [Dataset]. http://identifiers.org/RRID:SCR_005507
    Explore at:
    Dataset updated
    May 4, 2025
    Description

    MicrobesOnline is designed specifically to facilitate comparative studies on prokaryotic genomes. It is an entry point for operon, regulons, cis-regulatory and network predictions based on comparative analysis of genomes. The portal includes over 1000 complete genomes of bacteria, archaea and fungi and thousands of expression microarrays from diverse organisms ranging from model organisms such as Escherichia coli and Saccharomyces cerevisiae to environmental microbes such as Desulfovibrio vulgaris and Shewanella oneidensis. To assist in annotating genes and in reconstructing their evolutionary history, MicrobesOnline includes a comparative genome browser based on phylogenetic trees for every gene family as well as a species tree. To identify co-regulated genes, MicrobesOnline can search for genes based on their expression profile, and provides tools for identifying regulatory motifs and seeing if they are conserved. MicrobesOnline also includes fast phylogenetic profile searches, comparative views of metabolic pathways, operon predictions, a workbench for sequence analysis and integration with RegTransBase and other microbial genome resources. The next update of MicrobesOnline will contain significant new functionality, including comparative analysis of metagenomic sequence data. Programmatic access to the database, along with source code and documentation, is available at http://microbesonline.org/programmers.html.

  5. u

    Data from: Test data for CIBERER nextflow pipelines

    • portalcientifico.unav.edu
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel; Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel (2025). Test data for CIBERER nextflow pipelines [Dataset]. https://portalcientifico.unav.edu/documentos/67321e7daea56d4af048579b
    Explore at:
    Dataset updated
    2025
    Authors
    Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel; Ruiz-Arenas, Carlos; Sevilla-Porras, Marta; López López, Daniel
    Description

    The test files used for the mosaicism-nextflow pipeline, owned by CIBERER pipelines, are processed files. The raw data originates from sample NA18278 from the GIAB project, accessible at https://www.internationalgenome.org/data-portal/search?q=NA12878.

    The specific region selected for analysis is: 17:7577873-7580187.

    The test dataset for CNV (test_set.tar.gz), generated in silico by VISOR (https://doi.org/10.1093/bioinformatics/btz719), consists of 8 combinations of 4 different haplotypes derived from chromosome 22 (GRCh38), at an average coverage of 30x.

    A small test set for CNVs (CNV_test_set.tar.gz) consists on 10 samples from 1000 genomes project (hs37d5), each of them with a deletion in one exon of BRCA1.

  6. Gamarada debralockiae gen. nov. sp. nov.

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Nov 30, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Midgley; Brodie Sutcliffe; Paul Greenfield; Nai Tran-Dinh (2017). Gamarada debralockiae gen. nov. sp. nov. [Dataset]. http://doi.org/10.4225/08/5a1f4fa91a5e9
    Explore at:
    Dataset updated
    Nov 30, 2017
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    David Midgley; Brodie Sutcliffe; Paul Greenfield; Nai Tran-Dinh
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    Whole genome sequencing was performed on Gamarada debralockiae gen. nov. sp. nov. Following sequence assemblies, contigs greater than 1,000 nucleotides were subjected to gene annotation and analysis. We provide here the remaining contigs which were less than 1,000 nucleotides in length.

  7. f

    Virgo Benchmarking Datasets

    • figshare.com
    zip
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Riccardi (2025). Virgo Benchmarking Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.28730093.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 13, 2025
    Dataset provided by
    figshare
    Authors
    Christopher Riccardi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    1. Global Ocean Eukaryotic Viral (GOEV) DatabaseSource: Extract from Gaïa M. et al. Nature (2023)Components:591 MAGs from Schulz, F. et al. (2020) [DOI: 10.1038/s41586-020-1957-x]445 MAGs from Sunagawa, S. et al. (2020) [DOI: 10.1038/s41579-020-0364-5]218 MAGs from Moniruzzaman, M. et al. (2020) [DOI: 10.1038/s41467-020-15507-2]158 reference viral assembliesAccessed: July 20, 2024Data File: GOEV_DB_CONTIGS.db.zip from FigshareSelection Criteria: Only contigs labeled at the Order taxonomic level were retainedSampling Method: Not applicableFinal Sample Size: 1,412 viral contigs--2. Known Viral Sequence Clusters (kVSCs)Source: Extract from Zolfo, M. et al. (2024) [DOI: 10.1101/2024.02.19.580813]Data Files:VSC5_rep_fnas_nr99_45k_metaphlanDB.fna.gzVSCs_groups.csv metadataDownloaded From: Zenodo, last accessed June 28, 2024Selection Criteria:Started from 45,872 representative sequences from MetaPhlan 4.1Selected kVSCs (sequences clustering with a RefSeq representative)Verified RefSeq accessions against ICTV Release #39 for accurate labelingSampling Method: RefSeq matching based on metadataFinal Sample Size: 2,232 representative sequences--3. ICTV Release #39Source: International Committee on Taxonomy of Viruses (ICTV) Release #39Downloaded Using: ICTVdump tool on July 17, 2024Selection Criteria:Viruses present in both VMR releases #37 and #39At least two representatives per familySampling Method: Up to 5 genomes randomly sampled per family using pandas.sample(), 192 families represented.Final Sample Size: 860.--4. RefSeq Viral Dataset (Random Iteration)Source: NCBI Virus Portal NCBI Virus accessed on January 27, 2025Selection Criteria:Viruses with an assigned family-level taxonomy, up to 43 viruses per familySampling Method: Random uniform samplingFinal Sample Size: 6,778 viral genomes--5. RefSeq Viral Dataset (Prokaryote-Infecting)Source: NCBI Virus Portal NCBI Virus accessed on January 27, 2025Selection Criteria:Viruses with an assigned family-level taxonomyProkaryote-infecting viruses onlySampling Method: Random uniform samplingFinal Sample Size: 3,536 viral genomes--6. ICTV Release #39 (Reduction Study Subset)Source: International Committee on Taxonomy of Viruses (ICTV) Release #39Downloaded Using: ICTVdump tool on July 17, 2024Selection Criteria: 1,000 viruses randomly sampled from the full releaseSampling Method: Random uniform samplingFinal Sample Size: 1,000 viral genomesNotes: Reduction study starting data. We provide the source code for generating the fragmented genomes.
  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vasicek, Jakub (2024). Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10149277

Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set

Explore at:
Dataset updated
Dec 11, 2024
Dataset authored and provided by
Vasicek, Jakub
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the 1000 Genomes Project, aligned with the GRCh38 genome build (https://www.internationalgenome.org/data-portal/data-collection/grch38). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. The complete configuration file for each ProHap run is attached to this repository.

This data set contains six compressed directories, five representing the superpopulations included in the 1000 Genomes Project (https://catalog.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project), and one created using all the samples included in the 1000 Genomes data set:

AFR - African

AMR - American

EUR - European

SAS - South Asian

EAS - East Asian

ALL - all participants in the 1000 Genomes Project

Each of the directories contains the following files:

F1: The concatenated fasta file ready to be used with search engines, contains the following:

Protein haplotype sequences obtained by ProHap, using alleles with at least 1 % frequency within the selected population

Reference proteome as per Ensembl v. 110

Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/)

The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the simplified fasta file.

F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes

F3: Translations of haplotype cDNA sequences, before merging with the reference proteome

For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files.

For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches.

When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0

Search
Clear search
Close search
Google apps
Main menu