23 datasets found
  1. RNA-seq example data

    • kaggle.com
    zip
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuhin Rana (2023). RNA-seq example data [Dataset]. https://www.kaggle.com/datasets/rana2hin/rna-seq-example-data
    Explore at:
    zip(2193914798 bytes)Available download formats
    Dataset updated
    Jun 16, 2023
    Authors
    Tuhin Rana
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Dataset Description

    This dataset contains RNA-seq data from human cells. The data was collected using the Illumina HiSeq 2500 platform. The data includes raw sequencing reads, gene annotations, and phenotypic data for the samples.

    Files and Folders

    Files can be downloaded using the following command:

    wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
    

    Once the file has been downloaded, it can be extracted using the following command:

    tar xvzf chrX_data.tar.gz
    

    This will create a directory called chrX_data containing the following files:

    genes/chrX.gtf
    genome/chrX.fa
    geuvadis_phenodata.csv
    indexes/
    mergelist.txt
    samples/
    

    Here are some additional details about the files in the chrX_data directory:

    • genes/chrX.gtf - This file contains gene annotations for the human X chromosome. It is in the GTF format, which is a standard format for gene annotations. The GTF file contains information about the start and end positions of genes, as well as their transcripts.
    • genome/chrX.fa - This file contains the reference genome sequence for the human X chromosome. It is in the FASTA format, which is a standard format for storing DNA sequences.
    • geuvadis_phenodata.csv - This file contains phenotypic data for the samples in the dataset. The phenotypic data includes information such as the age, sex, and disease status of the samples.
    • indexes/ - This directory contains index files for HISAT2. Index files are used to speed up the alignment of sequencing reads to a reference genome.
    • mergelist.txt - This file lists the samples to be merged. The samples in the samples/ directory can be merged using a variety of tools, such as BEDTools and STAR.
    • samples/ - This directory contains the raw sequencing data. The raw sequencing data is in the FASTQ format, which is a standard format for storing sequencing reads.

    Usage

    This dataset can be used to perform RNA-seq analysis using a variety of tools, such as HISAT2, StringTie, and Ballgown.

    Here are some examples of how this dataset can be used:

    • To identify differentially expressed genes between two groups of samples.
    • To build a gene expression atlas for a particular tissue or cell type.
    • To study the expression of genes involved in a particular disease.

    source: ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

  2. s

    LifeDB

    • scicrunch.org
    • dknet.org
    • +2more
    Updated Jan 31, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2026). LifeDB [Dataset]. http://identifiers.org/RRID:SCR_006899
    Explore at:
    Dataset updated
    Jan 31, 2026
    Description

    Database that integrates large-scale functional genomics assays and manual cDNA annotation with bioinformatics gene expression and protein analysis. LifeDB integrates data regarding full length cDNA clones and data on expression of encoded protein and their subcellular localization on mammalian cell line. LifeDB enables the scientific community to systematically search and select genes, proteins as well as cDNA of interest by specific database identifiers as well as gene name. It enables to visualize cDNA clone and subcellular location of proteins. It also links the results to external biological databases in order to provide a broader functional information. LifeDB also provides an annotation pipeline which facilitates an improved mapping of clones to known human reference transcripts from the RefSeq database and the Ensembl database. An advanced web interface enables the researchers to view the data in a more user friendly manner. Users can search using any one of the following search options available both in Search gene and cDNA clones and Search Sub-cellular locations of human proteins: By Keyword, By gene/transcript identifier, By plate name, By clone name, By cellular location. * The Search genes and cDNA clones results include: Gene Name, Ensemble ID, Genomic Region, Clone name, Plate name, Plate position, Classification class, Synonymous SNP''s, Non- synonymous SNP''s, Number of ambiguous positions, and Alignment with reference genes. * The Search sub-cellular locations of human proteins results include: Subcellular location, Gene Name, Ensemble ID, Clone name, True localization, Images, Start tag and End tag. Every result page has an option to download result data (excluding the microscopy images). On click of ''Download results as CSV-file'' link in the result page the user will be given a choice to open or save result data in form of a CSV (Comma Separated Values) file. Later the CSV file can be easily opened using Excel or OpenOffice.

  3. u

    Genomes To Fields 2016

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    bin
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darwin Campbel; Natalia deLeon; Jode Edwards; Jack Gardiner; Naser Al Khalifah; Carolyn J. Lawrence-Dill; Jane Petzoldt; Cinta Romay; Renee Walton; Genomes to Fields Cooperators (2023). Genomes To Fields 2016 [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Genomes_To_Fields_2016/24852822
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    CyVerse Data Commons
    Authors
    Darwin Campbel; Natalia deLeon; Jode Edwards; Jack Gardiner; Naser Al Khalifah; Carolyn J. Lawrence-Dill; Jane Petzoldt; Cinta Romay; Renee Walton; Genomes to Fields Cooperators
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Phenotypic, genotypic, and environment data for the 2016 field season: The data is stored in CyVerse. Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental (soil, weather) data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields 2016 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/GenomesToFields_G2F_2016_Data_Mar_2018 Dataset (csv) and metadata (BibTex, Endnote) data downloads. See _readme.txt for file contents.

  4. Drosophila Melanogaster Genome

    • kaggle.com
    • ieee-dataport.org
    zip
    Updated Nov 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
    Explore at:
    zip(136202106 bytes)Available download formats
    Dataset updated
    Nov 17, 2019
    Authors
    Myles O'Neill
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Drosophila Melanogaster

    Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

    When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

    https://en.wikipedia.org/wiki/Drosophila_melanogaster

    About the Genome

    This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

    ![D. melanogaster chromosomes][1]

    The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

    Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

    Bioinformatics

    Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

    Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

    Learning Bioinformatics

    There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

    Files in this Dataset

    Drosophila Melanogaster Genome

    • genome.fa

    The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.

    Meta Information

    There are 3 additional files with meta information about the genome.

    • meta-cpg-island-ext-unmasked.csv

    This file contains descriptive information about CpG Islands in the genome.

    https://en.wikipedia.org/wiki/CpG_site

    • meta-cytoband.csv

    This file describes the positions of cytogenic bands on each chromosome.

    https://en.wikipedia.org/wiki/Cytogenetics

    • meta-simple-repeat.csv

    This file describes simple tandem repeats in the genome.

    https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat

    Drosophila Melanogaster mRNA Sequences

    Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

    https://en.wikipedia.org/wiki/Messenger_RNA

    • mrna-genbank.fa

    This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/genbank/

    • mrna-refseq.fa

    This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/refseq/

    Gene Predictions

    A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...

  5. Data from: Cell-specific gene-expression profiles and cortical thickness in...

    • figshare.com
    txt
    Updated Oct 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jean Shin; Leon French (2019). Cell-specific gene-expression profiles and cortical thickness in the human brain [Dataset]. http://doi.org/10.6084/m9.figshare.4752955.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 29, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jean Shin; Leon French
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Neurobiological underpinnings of variations in cortical structure, such as cortical thickness, in the human brain are largely unknown. In this data report, we describe a method to evaluate the contribution of nine neural cell-types in explaining inter-regional variations in cortical structure measured with FreeSurfer in participants from a study cohort, by correlating the cortical-structure data and gene-expression data from the Allen Human Brain Atlas, summarized into the FreeSurfer cortical regions and validated using the BrainSpan Atlas. The resulting graphical display and re-sampling based association test results can give new insights into the underlying neurobiological mechanisms of cortical structure variation.Files available: Detailed description of each file is included in ‘README.doc’AllenHBA_DK_ExpressionMatrix.tsv - Tab separated file containing correlation to the median values across the donors (left hemisphere only, column named ”Average donor correlation to median”) and gene expression values across the 68 FreeSurfer cortical regions (columns)DKRegionStatistics.tsv - Tab separated file characterizing the FreeSurfer cortical regions. This file lists how many donors contribute to each region, Allen Brain Atlas samples per region and alternative identifiers.BrainSpanToHBARegionMapping.csv - Comma separated file characterizing the 11 cortical regions that are commonly available for the Allen Brain and Brainspan Atlases.Reference_Consistent_Genes_ObtainedBy2StageFiltering.tsv - Tab separated file containing correlation coefficients ("correlation"), the corresponding 1-sided p-value ("pvalue") for gene-expression levels between Allen and Brainspan Atlases for the 11 cortical regions that are available for both atlases, and cell-type ("CellType") assignments based on Zeisel et al. (2015) for the 2,511 consistent genes obtained by applying the following 2-stage procedure.Profile_males_corcoef_thickness_age.csv - Comma separated file containing an example phenotype-profile (users need to format, like this example, and provide their own cohort-specific file).GetBrainSpanCorrelations.R - R script to calculate and test correlation for gene expression profiles from Allen Human Brain vs. Brainspan Atlases.LoadBrainSpanExpressionFunctions.R - R script defining a function to load expression data ‘Exon microarray summarized to genes’ available from Brainspan Atlas download website (http://brainspan.org/static/download.html).GetGOGroupsFilteredHumanLists.R - R script to run GO analysis, where the foreground gene set is each Zeisel cell type panel is used as a foreground gene set, and the background set is the reference set with the 2,511 genes that pass the 2-stage filtering procedure.GetExpressionPhenotypeCorrelations.R - R script to calculate expression-phenotype correlations and produce a 3-row-by-3-column graphical display of cell-type-specific empirical distributions of the resulting correlation coefficients with corresponding (1- significance level')×100% critical values, which are unadjusted for multiple testing.

  6. f

    Data from: Distinct evolutionary trajectories in the Escherichia coli...

    • microbiology.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Cummins; Rebecca Hall; Christopher H. Connor; James McInerney; Alan McNally (2023). Distinct evolutionary trajectories in the Escherichia coli pangenome occur within sequence types [Dataset]. http://doi.org/10.6084/m9.figshare.21360108.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Microbiology Society
    Authors
    Elizabeth Cummins; Rebecca Hall; Christopher H. Connor; James McInerney; Alan McNally
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material for 'Distinct evolutionary trajectories in the Escherichia coli pangenome occur within sequence types' as published in Microbial Genomics.

    pangenome_data.zip contains (for each ST): ST_updated_download.txt - the download list from Enterobase including accession numbers, Bio Project IDs and sample IDs. ST_gene_presence_absence_roary.csv - gene presence/absence matrix output by Panaroo ST_pan_genome_reference.fa - linear reference genome output by Panaroo

  7. Z

    Data from: MarFERReT: an open-source, version-controlled reference library...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Groussman, Mora J; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911
    Explore at:
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    University of Washington
    Authors
    Groussman, Mora J; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

    The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

    This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

    MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

    entry_id: Unique MarFERReT sequence entry identifier.

    accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

    marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

    tax_id: The NCBI Taxonomy ID (taxID).

    pr2_accession: Best-matching PR2 accession ID associated with entry

    pr2_rank: The lowest shared rank between the entry and the pr2_accession

    pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

    data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

    data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

    source_link: URL where the original sequence data and/or metadata was collected.

    pub_year: Year of data release or publication of linked reference.

    ref_link: Pubmed URL directs to the published reference for entry, if available.

    ref_doi: DOI of entry data from source, if available.

    source_filename: Name of the original sequence file name from the data source.

    seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

    n_seqs_raw: Number of sequences in the original sequence file.

    source_name: Full organism name from entry source

    original_taxID: Original NCBI taxID from entry data source metadata, if available

    alias: Additional identifiers for the entry, if available

    MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

    entry_id: Unique MarFERReT sequence entry identifier

    marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

    tax_id: Verified NCBI taxID used in MarFERReT

    taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

    taxID_notes: Notes on the original_taxID

    n_seqs_raw: Number of sequences in the original sequence file

    n_pfams: Number of Pfam domains identified in protein sequences

    qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

    flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

    VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

    flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

    rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

    rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

    flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

    flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

    accepted: Acceptance into the final MarFERReT build (Y or N).

    MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

    MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

    The columns in this file contain the following information:

    accession: (NA)

    accession.version: The unique MarFERReT sequence identifier ('mftX').

    taxid: The NCBI Taxonomy ID associated with this reference sequence.

    gi: (NA).

    MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

    aa_id: the unique identifier for each MarFERReT protein sequence.

    entry_id: The unique numeric identifier for each MarFERReT entry.

    source_defline: The original, unformatted sequence identifier

    MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

    aa_id: The unique MarFERReT protein sequence ID ('mftX').

    pfam_name: The shorthand Pfam protein family name.

    pfam_id: The Pfam identifier.

    pfam_eval: hmm profile match e-value score

    pfam_score: hmm profile match bitscore

    MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.

  8. d

    Data from: Genomes To Fields (G2F) Inbred Ear Imaging Data 2017

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Genomes To Fields (G2F) Inbred Ear Imaging Data 2017 [Dataset]. https://catalog.data.gov/dataset/genomes-to-fields-g2f-inbred-ear-imaging-data-2017-079c0
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    A subset of ~30 inbreds were evaluated in 2014 and 2015 to develop an image based ear phenotyping tool. The data is stored in CyVerse. Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields Inbred Ear Imaging 2017 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Edgar_Spalding_G2F_Inbred_Ear_Imaging_June_2017 Dataset (csv, tar.gz) and metadata (BibTex/Endnote) downloads. See _readme.txt for file contents.

  9. d

    Genomes To Fields 2014

    • datasets.ai
    • agdatacommons.nal.usda.gov
    • +1more
    21
    Updated Mar 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Agriculture (2024). Genomes To Fields 2014 [Dataset]. https://datasets.ai/datasets/genomes-to-fields-2014-d3326
    Explore at:
    21Available download formats
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Department of Agriculture
    Description

    Phenotypic, genotypic, and environment data for the 2014 field season: The data is stored in CyVerse.

    Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.


    Resources in this dataset:

  10. Gene Expression V2

    • kaggle.com
    zip
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira (2024). Gene Expression V2 [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/gene-expression-v2/suggestions
    Explore at:
    zip(18128 bytes)Available download formats
    Dataset updated
    Sep 25, 2024
    Authors
    willian oliveira
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Gene Expression Omnibus (GEO) dataset GSE68086 provides crucial insights into cancer diagnostics by analyzing tumor-educated platelets (TEPs), offering a unique approach to non-invasive cancer detection across multiple cancer types. This dataset is centered on RNA-seq analysis, which focuses on the gene expression profiles of platelets from cancer patients. Tumor-educated platelets, which are altered by the presence of tumors, represent a promising biomarker for liquid biopsies, a method that allows for cancer detection without the need for invasive tissue sampling.

    The dataset titled "RNA-seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics" focuses on Homo sapiens and utilizes expression profiling by high-throughput sequencing. It includes 283 samples of blood platelets, of which 228 are tumor-educated platelets from patients with six types of malignant tumors: non-small cell lung cancer, colorectal cancer, pancreatic cancer, glioblastoma, breast cancer, and hepatobiliary carcinomas. The remaining 55 samples are from healthy individuals, serving as control samples.

    The methodology for generating this dataset involved collecting blood samples using EDTA as an anticoagulant, isolating platelets, and extracting RNA using the mirVana RNA isolation kit. Following RNA extraction, cDNA synthesis and amplification were performed using the SMARTer Ultra Low RNA Kit, and sequencing was conducted using the Illumina HiSeq 2500 platform. Quality control was rigorously ensured by employing the Bioanalyzer 2100 system. Data processing steps involved the use of various bioinformatics tools, including Trimmomatic for quality control, STAR for mapping reads to the hg19 reference genome, Picard-tools for selecting intron-spanning reads, and HTseq for read summarization.

    The dataset's structure includes 285 columns representing samples (both TEP and healthy controls) and 57,736 rows corresponding to Ensembl gene IDs. The primary data format is intron-spanning read counts, and files available for download include both gzipped text files (such as GSE68086_TEP_data_matrix.txt.gz) and CSV files for easy access and manipulation. Detailed sample information is provided in the series matrix files, both in text and CSV formats.

    This dataset has several potential applications. It can be used to explore liquid biopsy techniques for non-invasive cancer diagnostics, identify cancer-specific biomarkers, and study cancer-induced changes in platelet RNA profiles. Researchers can perform comparative analyses across different cancer types and apply machine learning models for both binary classification (distinguishing between healthy individuals and cancer patients) and multiclass classification (differentiating between various cancer types). Molecular pathway analysis could also be employed to identify pathways specific to different cancers.

    The importance of this dataset lies in its potential to significantly advance cancer diagnostics by leveraging TEPs as biomarkers. This approach could enable early detection and more precise classification of cancers, offering a novel method of blood-based screening using gene expression profiles. The data can be accessed through the GEO platform under accession number GSE68086, and online analysis tools such as GEO2R and the GEOquery R package facilitate further analysis. This research was published by Best MG et al. in the Cancer Cell journal in 2015, where it was recognized for demonstrating the efficacy of tumor-educated platelets in pan-cancer diagnostics.

  11. u

    Data from: Expressed Sequence Tags from the Ciliate Protozoan Parasite...

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    xls
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason W. Abernathy; Peng Xu; Ping Li; De-Hai Xu; Huseyin Kucuktas; Phillip Klesius; Covadonga Arias; Zhanjiang Liu (2025). Expressed Sequence Tags from the Ciliate Protozoan Parasite Ichthyophthirius Multifiliis [Dataset]. http://doi.org/10.15482/USDA.ADC/1529231
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 22, 2025
    Dataset provided by
    BMC Genomics
    Authors
    Jason W. Abernathy; Peng Xu; Ping Li; De-Hai Xu; Huseyin Kucuktas; Phillip Klesius; Covadonga Arias; Zhanjiang Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Researchers sequenced 10,368 expressed sequence tags (EST) clones using a normalized cDNA library made from pooled samples of the trophont, tomont, and theront life-cycle stages, and generated 9,769 sequences (94.2% success rate). Post-sequencing processing led to 8,432 high quality sequences. Clustering analysis of these ESTs allowed identification of 4,706 unique sequences containing 976 contigs and 3,730 singletons. The ciliate protozoan Ichthyophthirius multifiliis (Ich) is an important parasite of freshwater fish that causes 'white spot disease' leading to significant losses. A genomic resource for large-scale studies of this parasite has been lacking. To study gene expression involved in Ich pathogenesis and virulence, our goal was to generate ESTs for the development of a powerful microarray platform for the analysis of global gene expression in this species. Here, we initiated a project to sequence and analyze over 10,000 ESTs. Resources in this dataset:Resource Title: Data Dictionary - Supplemental Tables 1, 2, and 3. File Name: IchthyophthiriusESTs_DataDictionary.csvResource Description: Machine-readable comma-separated values (CSV) definitions for data elements of Supplemental Tables 1-3 concerning I. multifiliis unique EST sequences, BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes, and gene ontology (GO) profile.Resource Title: Table 3. Table of gene ontology (GO) profiles.. File Name: 12864_2006_889_MOESM3_ESM.xlsResource Description: Supplemental Table 3, Excel spreadsheet; Table of gene ontology (GO) profiles; Provided information includes unique EST name, accession numbers, BLASTX top hit, GO identification numbers and enzyme commission (EC) numbers.

    Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176

    Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM3_ESM.xls Title: Table I. Multifiliis unique EST sequences. File Name: 12864_2006_889_MOESM1_ESM.xlsResource Description: Supplemental Table 1 for article, "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Table of I. multifiliis unique EST sequences; Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name and accession numbers. Also included are significant protein domain comparisons to the Swiss-Prot database. Putative secretory proteins are highlighted.

    Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176

    Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM1_ESM.xls Title: Table 2. Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. File Name: 12864_2006_889_MOESM2_ESM.xlsResource Description: Table 2 from "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name, tBLASTx top hits to the T. thermophila genome, and BLASTX top hits to the P. falciparum genome sequences. This table correlates with the Venn diagram in figure 1.

    Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176

    Direct download link for this data resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM2_ESM.xls

  12. Table_5_An integrative analysis of single-cell and bulk transcriptome and...

    • frontiersin.figshare.com
    txt
    Updated Dec 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hong-Kai Cui; Chao-Jie Tang; Yu Gao; Zi-Ang Li; Jian Zhang; Yong-Dong Li (2023). Table_5_An integrative analysis of single-cell and bulk transcriptome and bidirectional mendelian randomization analysis identified C1Q as a novel stimulated risk gene for Atherosclerosis.csv [Dataset]. http://doi.org/10.3389/fimmu.2023.1289223.s003
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Hong-Kai Cui; Chao-Jie Tang; Yu Gao; Zi-Ang Li; Jian Zhang; Yong-Dong Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe role of complement component 1q (C1Q) related genes on human atherosclerotic plaques (HAP) is less known. Our aim is to establish C1Q associated hub genes using single-cell RNA sequencing (scRNA-seq) and bulk RNA analysis to diagnose and predict HAP patients more effectively and investigate the association between C1Q and HAP (ischemic stroke) using bidirectional Mendelian randomization (MR) analysis.MethodsHAP scRNA-seq and bulk-RNA data were download from the Gene Expression Omnibus (GEO) database. The C1Q-related hub genes was screened using the GBM, LASSO and XGBoost algorithms. We built machine learning models to diagnose and distinguish between types of atherosclerosis using generalized linear models and receiver operating characteristics (ROC) analyses. Further, we scored the HALLMARK_COMPLEMENT signaling pathway using ssGSEA and confirmed hub gene expression through qRT-PCR in RAW264.7 macrophages and apoE-/- mice. Furthermore, the risk association between C1Q and HAP was assessed through bidirectional MR analysis, with C1Q as exposure and ischemic stroke (IS, large artery atherosclerosis) as outcomes. Inverse variance weighting (IVW) was used as the main method.ResultsWe utilized scRNA-seq dataset (GSE159677) to identify 24 cell clusters and 12 cell types, and revealed seven C1Q associated DEGs in both the scRNA-seq and GEO datasets. We then used GBM, LASSO and XGBoost to select C1QA and C1QC from the seven DEGs. Our findings indicated that both training and validation cohorts had satisfactory diagnostic accuracy for identifying patients with HPAs. Additionally, we confirmed SPI1 as a potential TF responsible for regulating the two hub genes in HAP. Our analysis further revealed that the HALLMARK_COMPLEMENT signaling pathway was correlated and activated with C1QA and C1QC. We confirmed high expression levels of C1QA, C1QC and SPI1 in ox-LDL-treated RAW264.7 macrophages and apoE-/- mice using qPCR. The results of MR indicated that there was a positive association between the genetic risk of C1Q and IS, as evidenced by an odds ratio (OR) of 1.118 (95%CI: 1.013–1.234, P = 0.027).ConclusionThe authors have effectively developed and validated a novel diagnostic signature comprising two genes for HAP, while MR analysis has provided evidence supporting a favorable association of C1Q on IS.

  13. d

    Data from: Quantifying the phenome-wide response to sex-specific selection...

    • search.dataone.org
    • datadryad.org
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Keaney; Luke Holman (2025). Quantifying the phenome-wide response to sex-specific selection in Drosophila melanogaster [Dataset]. http://doi.org/10.5061/dryad.2v6wwpzzp
    Explore at:
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Thomas Keaney; Luke Holman
    Description

    In species with separate sexes, selection on males causes evolutionary change in female traits values (and vice versa) via genetic correlations, which has far-reaching consequences for adaptation. Here, we utilise a sex-specific form of Robertson’s Secondary Theorem of Natural Selection to estimate the expected response to selection for 474 organismal-level traits and ~28,000 gene expression traits measured in the Drosophila Genetic Reference Panel (DGRP). Across organismal-level traits, selection acting on males produced a larger predicted evolutionary response than did selection acting on females, even for female traits; while for transcriptome traits selection on each sex produced a roughly equal average evolutionary response. For most traits, selection on males and females was predicted to move average trait values in the same direction, though for some traits, selection on one sex increased trait values while selection on the other sex decreased them, implying intralocus sexual con..., We performed a forward citation search of Google Scholar for articles that had cited the original DGRP paper (Mackay et al., 2012; this study introduced the resource and is the common citation among those that use it) as of January 2022, to obtain line mean estimates and associated meta-data for quantitative traits that have been measured in the DGRP (Figure 1). We supplemented our search by including all articles cited in an influential review of the DGRP (Mackay & Huang, 2018) and all datasets included on the DGRP2 web application (http://dgrp2.gnets.ncsu.edu/; Mackay et al., 2012; Huang et al., 2014). In total, we identified 126 studies that reported line means or raw data for 38,411 phenotypic traits. 19,259 of these were female traits and 19,081 male traits (the remaining 71 were estimated from mixed sex groups). 36,280 of these were reported by a single study, which measured whole-body expression level for many of the known genes in D. melanogaster (Huang et al., 2015). We res..., , # Data from: Quantifying the phenome-wide response to sex-specific selection in Drosophila melanogaster

    https://doi.org/10.5061/dryad.2v6wwpzzp

    Description of the data and file structure

    Data concern a vast range of quantitative traits measured across the Drosophila genetic reference panel. With these data you can run all analyses presented in the associated manuscript.

    If you would like to use the dataset for your own analysis, download meta_data_for_all_traits.csv and either all.dgrp.phenos_unscaled.csv or all.dgrp.phenos_scaled.csv depending on whether you want to use traits expressed on their original scale or on a standardised scale (mean = 0, SD = 1). Read this to see how we collated the dataset and conducted quality control. Read this to see how we further clea...

  14. Escherichia coli metadata files obtained from isolates listed in Enterobase...

    • figshare.com
    txt
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Albert Schriefer (2024). Escherichia coli metadata files obtained from isolates listed in Enterobase and PATRIC databases. [Dataset]. http://doi.org/10.6084/m9.figshare.25796323.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Albert Schriefer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SampleMiner (SM) is a Python script that helps researchers dependent on limited internet resources to locally perform microbial isolates sampling, metadata retrieval and whole-genome sequences (WGS) download, provided the metadata pages of the online database can be downloaded as comma-separated value (csv) files. The metadata tab-separated value (tsv) files listed in this Figshare item were used during the development of SM code. These files contain all E. coli isolates metadata obtained from Enterobase on 18/11/2021 and from PATRIC on 29/9/2021, using Escherichia or Shigella as combined keywords. The originally downloaded csv files were converted into the tsv files.

  15. Z

    Relevant datasets for sgRNA library characterization tasks

    • data.niaid.nih.gov
    • resodate.org
    Updated Aug 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivek Das (2023). Relevant datasets for sgRNA library characterization tasks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8285482
    Explore at:
    Dataset updated
    Aug 27, 2023
    Dataset provided by
    Novo Nordisk A/S
    Authors
    Vivek Das
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This contains all the relevant datasets that were downloaded, used and generated to execute the sgRNA library characterization in the notebook sgRNA_Ryvu.ipynb.

    This inludes a list of files:

    the hg38 fasta file in chunks

    The necessary indexes build with Bowtie 2

    The relevant mapped sam and bam files that are indexed and sorted

    The hg38 gtf file for gene annotation

    Relevant GDCquery download file for BRCA

    Final gene expresssion matrix file in CSV format.

  16. r

    nf-core/metatdenovo taxonomy

    • researchdata.se
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Lundin (2025). nf-core/metatdenovo taxonomy [Dataset]. http://doi.org/10.17044/SCILIFELAB.28211678
    Explore at:
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Linnaeus University
    Authors
    Daniel Lundin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data in this repository can be used to assign taxonomy to sequences with Diamond [Buchfink et al. 2015], particularly using the --diamond_dbs parameter in nf-core/metatdenovo (https://nf-co.re/metatdenovo) , release 1.1 or later.Currently, the data available represents species-representative genomes from the Genome Taxonomy Database (GTDB), release R09-RS220 [Parks et al. 2018].

    File preparationAll species-representative genomes from GTDB were downloaded from the National Center for Biotechnology Information (NCBI) and annotated with Prokka [v. 1.14.6; Seemann 2014], and the sequences for all resulting proteins were used for this data. The taxonomy dump files (in NCBI taxonomy dump format) were created from the GTDB metadata with TaxonKit [v. 0.18.0; Shen and Ren 2021] and the Diamond database with Diamond [v. 2.1.10; Buchfink et al. 2015] in "taxonomy mode", i.e. using the taxonomy dump created with TaxonKit. (See below for commands used.)

    File descriptionsThere are five files:

    • gtdb-r220.faa.gz: Fasta file with protein sequences. Not used by nf-core/metatdenovo but can be used to create the Diamond database below.
    • gtdb-r220.taxonomy.dmnd: Diamond database with taxonomy information.
    • gtdb-r220.names.dmp: Taxonomy dump file.
    • gtdb-r220.nodes.dmp: Nodes dump file.
    • gtdb-r220.seqid2taxid.tsv.gz: Mapping from protein accession to taxon. The Diamond database and taxonomy dump files can be used with nf-core/metatdenovo (Version >1.1) by providing a csv file like below to the --diamond_dbs parameter. (Although Nextflow can use https-urls for paths, it is usually better to download the very large files and keep local copies.)

    db,dmnd_path,taxdump_names,taxdump_nodes,ranks,parse_with_taxdump

    gtdb,gtdb_r220_repr.dmnd,gtdb_taxdump/names.dmp,gtdb_taxdump/nodes.dmp,domain;phylum;class;order;genus;species;strain,

    Commands used to prepare taxonomy dump files and the Diamond database- Taxonomy dump: cut -f 1,19-20 metadata.tsv | grep -v 'accession' | awk 'BEGIN { FS="\t" } { if ( $2 == "t" ) { print $1 "\t" $3 } }' | taxonkit create-taxdump --gtdb -O . - Diamond database: gunzip -c gtdb-r220.faa.gz | sed '/^>/s/ .//' | diamond makedb --taxonmap gtdb-r220.seqid2taxid.tsv.gz --taxonnames gtdb-r220.names.dmp --taxonnodes gtdb-r220.nodes.dmp --db gtdb-r220.taxonomy.dmnd --no-parse-seqids Revision history20250211 First version

  17. Pan-cancer Aberrant Pathway Activity Analysis (PAPAA)

    • zenodo.org
    application/gzip, csv +1
    Updated Dec 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DANIEL BLANKENBERG; DANIEL BLANKENBERG; VIJAY NAGAMPALLI; VIJAY NAGAMPALLI (2020). Pan-cancer Aberrant Pathway Activity Analysis (PAPAA) [Dataset]. http://doi.org/10.5281/zenodo.3625201
    Explore at:
    tsv, application/gzip, csvAvailable download formats
    Dataset updated
    Dec 5, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    DANIEL BLANKENBERG; DANIEL BLANKENBERG; VIJAY NAGAMPALLI; VIJAY NAGAMPALLI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information about the dataset files:

    1) pancan_rnaseq_freeze.tsv.gz: Publicly available gene expression data for the TCGA Pan-cancer dataset. File: PanCanAtlas EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611] [https://doi.org/10.1016/j.celrep.2018.03.046]

    2) pancan_mutation_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset. File: mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]

    3) pancan_GISTIC_threshold.tsv.gz: Publicly available Gene- level copy number information of the TCGA Pan-cancer dataset. This file is processed using script process_copynumber.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. The files copy_number_loss_status.tsv.gz and copy_number_gain_status.tsv.gz generated from this data are used as inputs in our Galaxy pipeline. [https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443] [https://doi.org/10.1016/j.celrep.2018.03.046]

    4) mutation_burden_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/][http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]

    5) sample_freeze.tsv or sample_freeze_version4_modify.tsv: The file lists the frozen samples as determined by TCGA PanCancer Atlas consortium along with raw RNAseq and mutation data. These were previously determined and included for all downstream analysis All other datasets were processed and subset according to the frozen samples.[https://github.com/greenelab/pancancer/]

    6) vogelstein_cancergenes.tsv: compendium of OG and TSG used for the analysis. [https://github.com/greenelab/pancancer/]

    7) CCLE_DepMap_18Q1_maf_20180207.txt.gz Publicly available Mutational data for CCLE cell lines from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2FCCLE_DepMap_18Q1_maf_20180207.txt]

    8) ccle_rnaseq_genes_rpkm_20180929.gct.gz: Publicly available Expression data for 1019 cell lines (RPKM) from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2Fccle_2019%2FCCLE_RNAseq_genes_rpkm_20180929.gct.gz]

    9) CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct: Publicly available merged Mutational and copy number alterations that include gene amplifications and deletions for the CCLE cell lines. This data is represented in the binary format and provided by the Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://data.broadinstitute.org/ccle_legacy_data/binary_calls_for_copy_number_and_mutation_data/CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct]

    10) GDSC_cell_lines_EXP_CCLE_names.csv.gz Publicly available RMA normalized expression data for Genomics of Drug Sensitivity in Cancer(GDSC) cell-lines. File gdsc_cell_line_RMA_proc_basalExp.csv was downloaded. This data was subsetted to 389 cell lines that are common among CCLE and GDSC. All the GDSC cell line names were replaced with CCLE cell line names for further processing. [https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip]

    11) GDSC_CCLE_common_mut_cnv_binary.csv.gz: A subset of merged Mutational and copy number alterations that include gene amplifications and deletions for common cell lines between GDSC and CCLE. This file is generated using CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct and a list of common cell lines.

    12) gdsc1_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC1 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_15Oct19.xlsx]

    13) gdsc2_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC2 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC2_fitted_dose_response_15Oct19.xlsx]

    14) compounds.csv: list of pharmacological compounds tested for our analysis

    15) tcga_dictonary.tsv: list of cancer types used in the analysis.

    16) seg_based_scores.tsv: Measurement of total copy number burden, Percent of genome altered by copy number alterations. This file was used as part of the Pancancer analysis by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/]

  18. n

    Simulating population divergence of Northern chamois in the Alps based on...

    • cmr.earthdata.nasa.gov
    • envidat.ch
    Updated Sep 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Simulating population divergence of Northern chamois in the Alps based on habitat dynamics [Dataset]. http://doi.org/10.16904/envidat.291
    Explore at:
    Dataset updated
    Sep 27, 2023
    Time period covered
    Jan 1, 2022
    Area covered
    Description

    General description Genomic data, habitat suitability raster files and scripts to run gen3sis to simulate cumulative divergence over time as approximation for genetic differentiation. Scripts for basic analysis of the simulations (e.g., create distance matrix from sampling locations) are provided, too. See original publication (doi link will be provided after publication) for details. The study area are the European Alps. All data is uploaded as zipped file. Unzip them after the download and put all data in one folder. See linked publications for correct citation of the data used, use of the data without correct citation is not allowed. Corresponding author: Flurin Leugger, email: flurin.leugger@gmail.com # Description of the data (content of the different zip folders) ## Abiotic data ### Glaciers Folders with raster stacks with glaciated areas at 0.05° resolution in WGS84 projection from Seguinot et al. (2018). Seguinot, J., Ivy-Ochs, S., Jouvet, G., Huss, M., Funk, M., & Preusser, F. (2018). Modelling last glacial cycle ice dynamics in the Alps. The Cryosphere, 12(10), 3265–3285. https://doi.org/10.5194/tc-12-3265-2018 ### Rivers * river_raster_elevation_class.tif: raster file (.tif) at 0.05° resolution and WGS84 projection with large rivers (scenario 2 from publication). The rivers (each cell) is classified according to the elevation of the cell. Natural Earth. (2018). Rivers + lake centerlines version 4.1.0. Retrieved January 22, 2020, from https://www.naturalearthdata.com/downloads/50m-physical-vectors/50m-rivers-lake-centerlines * river_raster_strahler_class_5km.tif: raster file at 0.05° resolution and WGS84 projection with medium rivers. The rivers are classified according to their Strahler order. Food and Agriculture Organization of the United Nations. (2014). Rivers in Europe (Derived from HydroSHEDS). Retrieved January 29, 2020, from http://www.fao.org/geonetwork/srv/fr/google.kml?uuid=e0243940-e5d9-487c-8102-45180cf1a99f&layers=AQUAMAPS:37253_rivers_europe ## Fossil records * chamois_fossil_combined_public.xlsx: list with fossil records until 20,000 years BP from Central Europe, see linked references for citation. ## Chamois occurrences * chamois_occurrence.csv: Chamois presences from all sources used for the publication (see Suppl. mat. Table S1 for detailed information and correct citations of the data) aggregated at 0.05° resolution (~5km). ## Gen3sis * config: folders with all configuration files used to run the simulations for the publication (different dispersal divergence parameters). * scripts: scripts (and helper functions) to run the gen3sis simulations including scripts for the beginning of the subsequent analysis. ## Genetic * populations.snps.light.vcf: vcf file of the sampled Northern chamois (Rupicapra rupicapra) . The genomic data encompasses 20k SNPs (from ddRAD sequencing). * Sequencing_final_without_slovakia.txt: sampling locations of Northern chamois (Rupicapra rupicapra) ## HSM * habitat_suitability_hindcasting: Aggregated habitat suitability raster files (stacks, .grd files) at 0.05° resolution and WGS84 projection from 20,000 years BP until today in 100 year time steps. There are separate folders for each environmental variable scenario used (different terrain slope variables) an the different occurrence/pseudo-absence sampling strategy used. * ODMAP_LeuggerEtAl2021-10-25.csv_: ODMAP protocol

  19. d

    Salmonella enterica pangenome graph and variant call data for 539,283...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Salmonella enterica pangenome graph and variant call data for 539,283 genomes [Dataset]. https://catalog.data.gov/dataset/isalmonella-enterica-ipangenome-graph-and-variant-call-data-for-539283-genomes
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Salmonella pangenome graph and variant call data for 539,283 genomes Description: Salmonella enterica causes human disease and decreases agricultural production. The overall goals of this project is to generate a large database of S. enterica variants with 539,283 samples and 236,069 features for applications in machine learning and genomics. We transformed single nucleotide polymorphism (SNP) data into reduced dimensional representations which are tolerant of missing data based on disentangled variational autoencoders. TFRecord files were made with custom Python scripts that parsed the variant call formats (VCF) into sparse tensors and combined them with the Salmonella In Silico Typing Resource (SISTR) serotype data. The data directory contains: The tar file of TFRecords: tfrecords.tar (103 GB). The TFRecords are organized first by how they were genotyped. mpileup records were created with Mpileup, and the gvg records were created with graph variant calling. In each of these directories batches of ~10,000 sequence reads named Sra10k_XX.tfrecord.gz (00--54). File Sra10k_99.tfrecord.gz contains incomplete SRAs. Each TFRecord contains the shape of the tensor, the indices of non-zero variants, sample name, serotype, and sparse values. Value 99 was assigned to '.' records.The file output.tar (11.4 TB) contains the .vcf files used to create the TFRecords above. The data in here is contained more succinctly in the TTFrecord format. This data will not normally be used.A tar file of metadata files for the samples, metadata (95 MB). Sequence read archive (SRA) accessions were downloaded using edirect/eutilities and saved as SraAccList.txt. esearch -db sra -query "txid28901[Organism:exp] AND (cluster_public[prop] AND 'biomol dna'[Properties] AND 'library layout paired'[Properties] AND 'platform illumina'[Properties] AND 'strategy wgs'[Properties] OR 'strategy wga'[Properties] OR 'strategy wcs'[Properties] OR 'strategy clone'[Properties] OR 'strategy finishing'[Properties] OR 'strategy validation'[Properties])" | efetch -format runinfo -mode xml | xtract -pattern Row -element Run > SraAccList.txt Google BigQuery was used to download metadata for the SRA accessions from the National Institute of Health (NIH). SELECT * FROM nih-sra-datastore.sra.metadata as metadata INNER JOIN {table_id} as leiacc ON metadata.acc = leiacc.accID; Files were processed into batches of ~10,000 and named Sra_completed_XX.csv (00--53). A VCF document mapping the TFRecord data to the positions in the graph subjected to the Type strain LT2: mapping/DRR452337.gvg.vcf-with_TFRecord_in_1st_column.txtScripts for creating and reading TFRecord data: code. reading_and_parsing_fns.py defines functions for converting VCFs of variants called using gvg to sparse tensors and makes the TFRecord files.gvg_to_tfrecord.py creates TFRecords from from the sparse tensors. Tutorial for using the TFRecords: Example_logistic_regression.mdPangenome graph files and references used for variant calling and genotyping: pangenome. refPlus100.fasta.gz which contains the genomes of the 101 Salmonella strains without plasmids used for construction of the pangenome graph.salm.100.NC_003197_v2.d2_complete.gfa.gz The complete 101 Salmonella strain pangenome graph in Graphical Fragment Assembly (GFA2) Format 2.0 including alt nodes used for genotypingsalm.100.NC_003197_v2.full.gfa.gz the full graph including alt nodes.salm.100.NC_003197_v2.full.vcf.gz A VCF of the file abovegenotyped.gvg.vcf the genotype calls in vcf formatpaths.txt the paths of the graph SCINet users: The data folder can be accessed/retrieved with valid SCINet account at this location: /LTS/ADCdatastorage/NAL/published/node28083194/ See the SCINet File Transfer guide for more information on moving large files: https://scinet.usda.gov/guides/data/datatransfer Globus users: The files can also be accessed through Globus by following this data link. The user will need to log in to Globus in order to access this data. User accounts are free of charge with several options for signing on. Instructions for creating an account are on the login page.

  20. SNP dataset for GWAS

    • kaggle.com
    zip
    Updated Feb 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piotr Szulc (2023). SNP dataset for GWAS [Dataset]. https://www.kaggle.com/datasets/seascape/snp-dataset-for-gwas
    Explore at:
    zip(143503906 bytes)Available download formats
    Dataset updated
    Feb 27, 2023
    Authors
    Piotr Szulc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The data includes genotypes of 482906 markers for 1,000 individuals. They come from a simulation based on the Illumina 650K human array, typically used for SNP genotyping.

    In theory, it's easy to create such data, it's just columns with values of 0, 1 and 2, but what's important is the correlation structure that has been preserved here and corresponds to the real one.

    The data can be used to test methods for finding significant SNPs. You can generate a trait based on the significant variables of your choice, and then try to find them using the chosen technique (which is not easy, due to the huge number of variables).

    The y.txt file contains the trait I simulated based on the following list of 24 SNPs: - ch01_19810 - ch01_27796 - ch01_32763 - ch02_22034 - ch02_39189 - ch03_2703 - ch03_10846 - ch04_05127 - ch05_7371 - ch06_25838 - ch08_15190 - ch10_444 - ch10_8265 - ch11_12611 - ch11_20057 - ch12_3421 - ch14_6999 - ch15_3859 - ch16_4525 - ch17_4306 - ch18_1031 - ch19_1377 - ch19_6378 - ch22_33

    See which ones you can find!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tuhin Rana (2023). RNA-seq example data [Dataset]. https://www.kaggle.com/datasets/rana2hin/rna-seq-example-data
Organization logo

RNA-seq example data

expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
zip(2193914798 bytes)Available download formats
Dataset updated
Jun 16, 2023
Authors
Tuhin Rana
License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

Dataset Description

This dataset contains RNA-seq data from human cells. The data was collected using the Illumina HiSeq 2500 platform. The data includes raw sequencing reads, gene annotations, and phenotypic data for the samples.

Files and Folders

Files can be downloaded using the following command:

wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Once the file has been downloaded, it can be extracted using the following command:

tar xvzf chrX_data.tar.gz

This will create a directory called chrX_data containing the following files:

genes/chrX.gtf
genome/chrX.fa
geuvadis_phenodata.csv
indexes/
mergelist.txt
samples/

Here are some additional details about the files in the chrX_data directory:

  • genes/chrX.gtf - This file contains gene annotations for the human X chromosome. It is in the GTF format, which is a standard format for gene annotations. The GTF file contains information about the start and end positions of genes, as well as their transcripts.
  • genome/chrX.fa - This file contains the reference genome sequence for the human X chromosome. It is in the FASTA format, which is a standard format for storing DNA sequences.
  • geuvadis_phenodata.csv - This file contains phenotypic data for the samples in the dataset. The phenotypic data includes information such as the age, sex, and disease status of the samples.
  • indexes/ - This directory contains index files for HISAT2. Index files are used to speed up the alignment of sequencing reads to a reference genome.
  • mergelist.txt - This file lists the samples to be merged. The samples in the samples/ directory can be merged using a variety of tools, such as BEDTools and STAR.
  • samples/ - This directory contains the raw sequencing data. The raw sequencing data is in the FASTQ format, which is a standard format for storing sequencing reads.

Usage

This dataset can be used to perform RNA-seq analysis using a variety of tools, such as HISAT2, StringTie, and Ballgown.

Here are some examples of how this dataset can be used:

  • To identify differentially expressed genes between two groups of samples.
  • To build a gene expression atlas for a particular tissue or cell type.
  • To study the expression of genes involved in a particular disease.

source: ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Search
Clear search
Close search
Google apps
Main menu