23 datasets found

RNA-seq example data
kaggle.com
zip
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuhin Rana (2023). RNA-seq example data [Dataset]. https://www.kaggle.com/datasets/rana2hin/rna-seq-example-data
Explore at:
zip(2193914798 bytes)Available download formats
Dataset updated
Jun 16, 2023
Authors
Tuhin Rana
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Dataset Description

This dataset contains RNA-seq data from human cells. The data was collected using the Illumina HiSeq 2500 platform. The data includes raw sequencing reads, gene annotations, and phenotypic data for the samples.

Files and Folders

Files can be downloaded using the following command:

wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Once the file has been downloaded, it can be extracted using the following command:

tar xvzf chrX_data.tar.gz

This will create a directory called chrX_data containing the following files:

genes/chrX.gtf genome/chrX.fa geuvadis_phenodata.csv indexes/ mergelist.txt samples/

Here are some additional details about the files in the chrX_data directory:

genes/chrX.gtf - This file contains gene annotations for the human X chromosome. It is in the GTF format, which is a standard format for gene annotations. The GTF file contains information about the start and end positions of genes, as well as their transcripts.

genome/chrX.fa - This file contains the reference genome sequence for the human X chromosome. It is in the FASTA format, which is a standard format for storing DNA sequences.

geuvadis_phenodata.csv - This file contains phenotypic data for the samples in the dataset. The phenotypic data includes information such as the age, sex, and disease status of the samples.

indexes/ - This directory contains index files for HISAT2. Index files are used to speed up the alignment of sequencing reads to a reference genome.

mergelist.txt - This file lists the samples to be merged. The samples in the samples/ directory can be merged using a variety of tools, such as BEDTools and STAR.

samples/ - This directory contains the raw sequencing data. The raw sequencing data is in the FASTQ format, which is a standard format for storing sequencing reads.

Usage

This dataset can be used to perform RNA-seq analysis using a variety of tools, such as HISAT2, StringTie, and Ballgown.

Here are some examples of how this dataset can be used:

To identify differentially expressed genes between two groups of samples.

To build a gene expression atlas for a particular tissue or cell type.

To study the expression of genes involved in a particular disease.

source: ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
s
LifeDB
scicrunch.org
dknet.org
+2more
Updated Jan 31, 2026
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2026). LifeDB [Dataset]. http://identifiers.org/RRID:SCR_006899
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006899
Dataset updated
Jan 31, 2026
Description
Database that integrates large-scale functional genomics assays and manual cDNA annotation with bioinformatics gene expression and protein analysis. LifeDB integrates data regarding full length cDNA clones and data on expression of encoded protein and their subcellular localization on mammalian cell line. LifeDB enables the scientific community to systematically search and select genes, proteins as well as cDNA of interest by specific database identifiers as well as gene name. It enables to visualize cDNA clone and subcellular location of proteins. It also links the results to external biological databases in order to provide a broader functional information. LifeDB also provides an annotation pipeline which facilitates an improved mapping of clones to known human reference transcripts from the RefSeq database and the Ensembl database. An advanced web interface enables the researchers to view the data in a more user friendly manner. Users can search using any one of the following search options available both in Search gene and cDNA clones and Search Sub-cellular locations of human proteins: By Keyword, By gene/transcript identifier, By plate name, By clone name, By cellular location. * The Search genes and cDNA clones results include: Gene Name, Ensemble ID, Genomic Region, Clone name, Plate name, Plate position, Classification class, Synonymous SNP''s, Non- synonymous SNP''s, Number of ambiguous positions, and Alignment with reference genes. * The Search sub-cellular locations of human proteins results include: Subcellular location, Gene Name, Ensemble ID, Clone name, True localization, Images, Start tag and End tag. Every result page has an option to download result data (excluding the microscopy images). On click of ''Download results as CSV-file'' link in the result page the user will be given a choice to open or save result data in form of a CSV (Comma Separated Values) file. Later the CSV file can be easily opened using Excel or OpenOffice.
u
Genomes To Fields 2016
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darwin Campbel; Natalia deLeon; Jode Edwards; Jack Gardiner; Naser Al Khalifah; Carolyn J. Lawrence-Dill; Jane Petzoldt; Cinta Romay; Renee Walton; Genomes to Fields Cooperators (2023). Genomes To Fields 2016 [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Genomes_To_Fields_2016/24852822
Explore at:
binAvailable download formats
Dataset updated
Dec 18, 2023
Dataset provided by
CyVerse Data Commons
Authors
Darwin Campbel; Natalia deLeon; Jode Edwards; Jack Gardiner; Naser Al Khalifah; Carolyn J. Lawrence-Dill; Jane Petzoldt; Cinta Romay; Renee Walton; Genomes to Fields Cooperators
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Phenotypic, genotypic, and environment data for the 2016 field season: The data is stored in CyVerse. Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental (soil, weather) data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields 2016 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/GenomesToFields_G2F_2016_Data_Mar_2018 Dataset (csv) and metadata (BibTex, Endnote) data downloads. See _readme.txt for file contents.
Drosophila Melanogaster Genome
kaggle.com
ieee-dataport.org
zip
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
Explore at:
zip(136202106 bytes)Available download formats
Dataset updated
Nov 17, 2019
Authors
Myles O'Neill
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Drosophila Melanogaster

Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

https://en.wikipedia.org/wiki/Drosophila_melanogaster

About the Genome

This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

![D. melanogaster chromosomes][1]

The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

Bioinformatics

Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

Learning Bioinformatics

There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

Files in this Dataset

Drosophila Melanogaster Genome

genome.fa

The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.
Meta Information

There are 3 additional files with meta information about the genome.

meta-cpg-island-ext-unmasked.csv

This file contains descriptive information about CpG Islands in the genome.

https://en.wikipedia.org/wiki/CpG_site

meta-cytoband.csv

This file describes the positions of cytogenic bands on each chromosome.

https://en.wikipedia.org/wiki/Cytogenetics

meta-simple-repeat.csv

This file describes simple tandem repeats in the genome.

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat
Drosophila Melanogaster mRNA Sequences

Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

https://en.wikipedia.org/wiki/Messenger_RNA

mrna-genbank.fa

This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/genbank/

mrna-refseq.fa

This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/refseq/
Gene Predictions

A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...
Data from: Cell-specific gene-expression profiles and cortical thickness in...
figshare.com
txt
Updated Oct 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jean Shin; Leon French (2019). Cell-specific gene-expression profiles and cortical thickness in the human brain [Dataset]. http://doi.org/10.6084/m9.figshare.4752955.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4752955.v3
Dataset updated
Oct 29, 2019
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jean Shin; Leon French
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Neurobiological underpinnings of variations in cortical structure, such as cortical thickness, in the human brain are largely unknown. In this data report, we describe a method to evaluate the contribution of nine neural cell-types in explaining inter-regional variations in cortical structure measured with FreeSurfer in participants from a study cohort, by correlating the cortical-structure data and gene-expression data from the Allen Human Brain Atlas, summarized into the FreeSurfer cortical regions and validated using the BrainSpan Atlas. The resulting graphical display and re-sampling based association test results can give new insights into the underlying neurobiological mechanisms of cortical structure variation.Files available: Detailed description of each file is included in ‘README.doc’AllenHBA_DK_ExpressionMatrix.tsv - Tab separated file containing correlation to the median values across the donors (left hemisphere only, column named ”Average donor correlation to median”) and gene expression values across the 68 FreeSurfer cortical regions (columns)DKRegionStatistics.tsv - Tab separated file characterizing the FreeSurfer cortical regions. This file lists how many donors contribute to each region, Allen Brain Atlas samples per region and alternative identifiers.BrainSpanToHBARegionMapping.csv - Comma separated file characterizing the 11 cortical regions that are commonly available for the Allen Brain and Brainspan Atlases.Reference_Consistent_Genes_ObtainedBy2StageFiltering.tsv - Tab separated file containing correlation coefficients ("correlation"), the corresponding 1-sided p-value ("pvalue") for gene-expression levels between Allen and Brainspan Atlases for the 11 cortical regions that are available for both atlases, and cell-type ("CellType") assignments based on Zeisel et al. (2015) for the 2,511 consistent genes obtained by applying the following 2-stage procedure.Profile_males_corcoef_thickness_age.csv - Comma separated file containing an example phenotype-profile (users need to format, like this example, and provide their own cohort-specific file).GetBrainSpanCorrelations.R - R script to calculate and test correlation for gene expression profiles from Allen Human Brain vs. Brainspan Atlases.LoadBrainSpanExpressionFunctions.R - R script defining a function to load expression data ‘Exon microarray summarized to genes’ available from Brainspan Atlas download website (http://brainspan.org/static/download.html).GetGOGroupsFilteredHumanLists.R - R script to run GO analysis, where the foreground gene set is each Zeisel cell type panel is used as a foreground gene set, and the background set is the reference set with the 2,511 genes that pass the 2-stage filtering procedure.GetExpressionPhenotypeCorrelations.R - R script to calculate expression-phenotype correlations and produce a 3-row-by-3-column graphical display of cell-type-specific empirical distributions of the resulting correlation coefficients with corresponding (1- significance level')×100% critical values, which are unadjusted for multiple testing.
f
Data from: Distinct evolutionary trajectories in the Escherichia coli...
microbiology.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Cummins; Rebecca Hall; Christopher H. Connor; James McInerney; Alan McNally (2023). Distinct evolutionary trajectories in the Escherichia coli pangenome occur within sequence types [Dataset]. http://doi.org/10.6084/m9.figshare.21360108.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21360108.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Microbiology Society
Authors
Elizabeth Cummins; Rebecca Hall; Christopher H. Connor; James McInerney; Alan McNally
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material for 'Distinct evolutionary trajectories in the Escherichia coli pangenome occur within sequence types' as published in Microbial Genomics.

pangenome_data.zip contains (for each ST): ST_updated_download.txt - the download list from Enterobase including accession numbers, Bio Project IDs and sample IDs. ST_gene_presence_absence_roary.csv - gene presence/absence matrix output by Panaroo ST_pan_genome_reference.fa - linear reference genome output by Panaroo
Z
Data from: MarFERReT: an open-source, version-controlled reference library...
data.niaid.nih.gov
zenodo.org
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Groussman, Mora J; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911
Explore at:
Dataset updated
Jan 22, 2025
Dataset provided by
University of Washington
Authors
Groussman, Mora J; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier.

accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

tax_id: The NCBI Taxonomy ID (taxID).

pr2_accession: Best-matching PR2 accession ID associated with entry

pr2_rank: The lowest shared rank between the entry and the pr2_accession

pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

source_link: URL where the original sequence data and/or metadata was collected.

pub_year: Year of data release or publication of linked reference.

ref_link: Pubmed URL directs to the published reference for entry, if available.

ref_doi: DOI of entry data from source, if available.

source_filename: Name of the original sequence file name from the data source.

seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

n_seqs_raw: Number of sequences in the original sequence file.

source_name: Full organism name from entry source

original_taxID: Original NCBI taxID from entry data source metadata, if available

alias: Additional identifiers for the entry, if available

MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier

marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

tax_id: Verified NCBI taxID used in MarFERReT

taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

taxID_notes: Notes on the original_taxID

n_seqs_raw: Number of sequences in the original sequence file

n_pfams: Number of Pfam domains identified in protein sequences

qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

accepted: Acceptance into the final MarFERReT build (Y or N).

MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

The columns in this file contain the following information:

accession: (NA)

accession.version: The unique MarFERReT sequence identifier ('mftX').

taxid: The NCBI Taxonomy ID associated with this reference sequence.

gi: (NA).

MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

aa_id: the unique identifier for each MarFERReT protein sequence.

entry_id: The unique numeric identifier for each MarFERReT entry.

source_defline: The original, unformatted sequence identifier

MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

aa_id: The unique MarFERReT protein sequence ID ('mftX').

pfam_name: The shorthand Pfam protein family name.

pfam_id: The Pfam identifier.

pfam_eval: hmm profile match e-value score

pfam_score: hmm profile match bitscore

MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.
d
Data from: Genomes To Fields (G2F) Inbred Ear Imaging Data 2017
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Genomes To Fields (G2F) Inbred Ear Imaging Data 2017 [Dataset]. https://catalog.data.gov/dataset/genomes-to-fields-g2f-inbred-ear-imaging-data-2017-079c0
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
A subset of ~30 inbreds were evaluated in 2014 and 2015 to develop an image based ear phenotyping tool. The data is stored in CyVerse. Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields Inbred Ear Imaging 2017 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Edgar_Spalding_G2F_Inbred_Ear_Imaging_June_2017 Dataset (csv, tar.gz) and metadata (BibTex/Endnote) downloads. See _readme.txt for file contents.
d
Genomes To Fields 2014
datasets.ai
agdatacommons.nal.usda.gov
+1more
21
Updated Mar 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Agriculture (2024). Genomes To Fields 2014 [Dataset]. https://datasets.ai/datasets/genomes-to-fields-2014-d3326
Explore at:
21Available download formats
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Department of Agriculture
Description
Phenotypic, genotypic, and environment data for the 2014 field season: The data is stored in CyVerse.

Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.

Resources in this dataset:

Resource Title: CyVerse Genomes To Fields 2014 dataset download.

File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Carolyn_Lawrence_Dill_G2F_Nov_2016_V.3
Dataset (csv, h5, gz) and metadata (BibTex/Endnote) downloads. See _readme.txt for file contents.
Gene Expression V2
kaggle.com
zip
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
willian oliveira (2024). Gene Expression V2 [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/gene-expression-v2/suggestions
Explore at:
zip(18128 bytes)Available download formats
Dataset updated
Sep 25, 2024
Authors
willian oliveira
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Gene Expression Omnibus (GEO) dataset GSE68086 provides crucial insights into cancer diagnostics by analyzing tumor-educated platelets (TEPs), offering a unique approach to non-invasive cancer detection across multiple cancer types. This dataset is centered on RNA-seq analysis, which focuses on the gene expression profiles of platelets from cancer patients. Tumor-educated platelets, which are altered by the presence of tumors, represent a promising biomarker for liquid biopsies, a method that allows for cancer detection without the need for invasive tissue sampling.

The dataset titled "RNA-seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics" focuses on Homo sapiens and utilizes expression profiling by high-throughput sequencing. It includes 283 samples of blood platelets, of which 228 are tumor-educated platelets from patients with six types of malignant tumors: non-small cell lung cancer, colorectal cancer, pancreatic cancer, glioblastoma, breast cancer, and hepatobiliary carcinomas. The remaining 55 samples are from healthy individuals, serving as control samples.

The methodology for generating this dataset involved collecting blood samples using EDTA as an anticoagulant, isolating platelets, and extracting RNA using the mirVana RNA isolation kit. Following RNA extraction, cDNA synthesis and amplification were performed using the SMARTer Ultra Low RNA Kit, and sequencing was conducted using the Illumina HiSeq 2500 platform. Quality control was rigorously ensured by employing the Bioanalyzer 2100 system. Data processing steps involved the use of various bioinformatics tools, including Trimmomatic for quality control, STAR for mapping reads to the hg19 reference genome, Picard-tools for selecting intron-spanning reads, and HTseq for read summarization.

The dataset's structure includes 285 columns representing samples (both TEP and healthy controls) and 57,736 rows corresponding to Ensembl gene IDs. The primary data format is intron-spanning read counts, and files available for download include both gzipped text files (such as GSE68086_TEP_data_matrix.txt.gz) and CSV files for easy access and manipulation. Detailed sample information is provided in the series matrix files, both in text and CSV formats.

This dataset has several potential applications. It can be used to explore liquid biopsy techniques for non-invasive cancer diagnostics, identify cancer-specific biomarkers, and study cancer-induced changes in platelet RNA profiles. Researchers can perform comparative analyses across different cancer types and apply machine learning models for both binary classification (distinguishing between healthy individuals and cancer patients) and multiclass classification (differentiating between various cancer types). Molecular pathway analysis could also be employed to identify pathways specific to different cancers.

The importance of this dataset lies in its potential to significantly advance cancer diagnostics by leveraging TEPs as biomarkers. This approach could enable early detection and more precise classification of cancers, offering a novel method of blood-based screening using gene expression profiles. The data can be accessed through the GEO platform under accession number GSE68086, and online analysis tools such as GEO2R and the GEOquery R package facilitate further analysis. This research was published by Best MG et al. in the Cancer Cell journal in 2015, where it was recognized for demonstrating the efficacy of tumor-educated platelets in pan-cancer diagnostics.
u
Data from: Expressed Sequence Tags from the Ciliate Protozoan Parasite...
agdatacommons.nal.usda.gov
catalog.data.gov
xls
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason W. Abernathy; Peng Xu; Ping Li; De-Hai Xu; Huseyin Kucuktas; Phillip Klesius; Covadonga Arias; Zhanjiang Liu (2025). Expressed Sequence Tags from the Ciliate Protozoan Parasite Ichthyophthirius Multifiliis [Dataset]. http://doi.org/10.15482/USDA.ADC/1529231
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1529231
Dataset updated
Nov 22, 2025
Dataset provided by
BMC Genomics
Authors
Jason W. Abernathy; Peng Xu; Ping Li; De-Hai Xu; Huseyin Kucuktas; Phillip Klesius; Covadonga Arias; Zhanjiang Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Researchers sequenced 10,368 expressed sequence tags (EST) clones using a normalized cDNA library made from pooled samples of the trophont, tomont, and theront life-cycle stages, and generated 9,769 sequences (94.2% success rate). Post-sequencing processing led to 8,432 high quality sequences. Clustering analysis of these ESTs allowed identification of 4,706 unique sequences containing 976 contigs and 3,730 singletons. The ciliate protozoan Ichthyophthirius multifiliis (Ich) is an important parasite of freshwater fish that causes 'white spot disease' leading to significant losses. A genomic resource for large-scale studies of this parasite has been lacking. To study gene expression involved in Ich pathogenesis and virulence, our goal was to generate ESTs for the development of a powerful microarray platform for the analysis of global gene expression in this species. Here, we initiated a project to sequence and analyze over 10,000 ESTs. Resources in this dataset:Resource Title: Data Dictionary - Supplemental Tables 1, 2, and 3. File Name: IchthyophthiriusESTs_DataDictionary.csvResource Description: Machine-readable comma-separated values (CSV) definitions for data elements of Supplemental Tables 1-3 concerning I. multifiliis unique EST sequences, BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes, and gene ontology (GO) profile.Resource Title: Table 3. Table of gene ontology (GO) profiles.. File Name: 12864_2006_889_MOESM3_ESM.xlsResource Description: Supplemental Table 3, Excel spreadsheet; Table of gene ontology (GO) profiles; Provided information includes unique EST name, accession numbers, BLASTX top hit, GO identification numbers and enzyme commission (EC) numbers.

Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176

Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM3_ESM.xls Title: Table I. Multifiliis unique EST sequences. File Name: 12864_2006_889_MOESM1_ESM.xlsResource Description: Supplemental Table 1 for article, "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Table of I. multifiliis unique EST sequences; Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name and accession numbers. Also included are significant protein domain comparisons to the Swiss-Prot database. Putative secretory proteins are highlighted.

Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176

Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM1_ESM.xls Title: Table 2. Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. File Name: 12864_2006_889_MOESM2_ESM.xlsResource Description: Table 2 from "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name, tBLASTx top hits to the T. thermophila genome, and BLASTX top hits to the P. falciparum genome sequences. This table correlates with the Venn diagram in figure 1.

Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176

Direct download link for this data resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM2_ESM.xls
Table_5_An integrative analysis of single-cell and bulk transcriptome and...
frontiersin.figshare.com
txt
Updated Dec 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hong-Kai Cui; Chao-Jie Tang; Yu Gao; Zi-Ang Li; Jian Zhang; Yong-Dong Li (2023). Table_5_An integrative analysis of single-cell and bulk transcriptome and bidirectional mendelian randomization analysis identified C1Q as a novel stimulated risk gene for Atherosclerosis.csv [Dataset]. http://doi.org/10.3389/fimmu.2023.1289223.s003
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2023.1289223.s003
Dataset updated
Dec 21, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Hong-Kai Cui; Chao-Jie Tang; Yu Gao; Zi-Ang Li; Jian Zhang; Yong-Dong Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe role of complement component 1q (C1Q) related genes on human atherosclerotic plaques (HAP) is less known. Our aim is to establish C1Q associated hub genes using single-cell RNA sequencing (scRNA-seq) and bulk RNA analysis to diagnose and predict HAP patients more effectively and investigate the association between C1Q and HAP (ischemic stroke) using bidirectional Mendelian randomization (MR) analysis.MethodsHAP scRNA-seq and bulk-RNA data were download from the Gene Expression Omnibus (GEO) database. The C1Q-related hub genes was screened using the GBM, LASSO and XGBoost algorithms. We built machine learning models to diagnose and distinguish between types of atherosclerosis using generalized linear models and receiver operating characteristics (ROC) analyses. Further, we scored the HALLMARK_COMPLEMENT signaling pathway using ssGSEA and confirmed hub gene expression through qRT-PCR in RAW264.7 macrophages and apoE-/- mice. Furthermore, the risk association between C1Q and HAP was assessed through bidirectional MR analysis, with C1Q as exposure and ischemic stroke (IS, large artery atherosclerosis) as outcomes. Inverse variance weighting (IVW) was used as the main method.ResultsWe utilized scRNA-seq dataset (GSE159677) to identify 24 cell clusters and 12 cell types, and revealed seven C1Q associated DEGs in both the scRNA-seq and GEO datasets. We then used GBM, LASSO and XGBoost to select C1QA and C1QC from the seven DEGs. Our findings indicated that both training and validation cohorts had satisfactory diagnostic accuracy for identifying patients with HPAs. Additionally, we confirmed SPI1 as a potential TF responsible for regulating the two hub genes in HAP. Our analysis further revealed that the HALLMARK_COMPLEMENT signaling pathway was correlated and activated with C1QA and C1QC. We confirmed high expression levels of C1QA, C1QC and SPI1 in ox-LDL-treated RAW264.7 macrophages and apoE-/- mice using qPCR. The results of MR indicated that there was a positive association between the genetic risk of C1Q and IS, as evidenced by an odds ratio (OR) of 1.118 (95%CI: 1.013–1.234, P = 0.027).ConclusionThe authors have effectively developed and validated a novel diagnostic signature comprising two genes for HAP, while MR analysis has provided evidence supporting a favorable association of C1Q on IS.
d
Data from: Quantifying the phenome-wide response to sex-specific selection...
search.dataone.org
datadryad.org
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Keaney; Luke Holman (2025). Quantifying the phenome-wide response to sex-specific selection in Drosophila melanogaster [Dataset]. http://doi.org/10.5061/dryad.2v6wwpzzp
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2v6wwpzzp
Dataset updated
Feb 26, 2025
Dataset provided by
Dryad Digital Repository
Authors
Thomas Keaney; Luke Holman
Description
In species with separate sexes, selection on males causes evolutionary change in female traits values (and vice versa) via genetic correlations, which has far-reaching consequences for adaptation. Here, we utilise a sex-specific form of Robertsonâ€™s Secondary Theorem of Natural Selection to estimate the expected response to selection for 474 organismal-level traits and ~28,000 gene expression traits measured in the Drosophila Genetic Reference Panel (DGRP). Across organismal-level traits, selection acting on males produced a larger predicted evolutionary response than did selection acting on females, even for female traits; while for transcriptome traits selection on each sex produced a roughly equal average evolutionary response. For most traits, selection on males and females was predicted to move average trait values in the same direction, though for some traits, selection on one sex increased trait values while selection on the other sex decreased them, implying intralocus sexual con..., We performed a forward citation search of Google Scholar for articles that had cited the original DGRP paper (Mackay et al., 2012; this study introduced the resource and is the common citation among those that use it) as of January 2022, to obtain line mean estimates and associated meta-data for quantitative traits that have been measured in the DGRP (Figure 1). We supplemented our search by including all articles cited in an influential review of the DGRP (Mackay & Huang, 2018) and all datasets included on the DGRP2 web application (http://dgrp2.gnets.ncsu.edu/; Mackay et al., 2012; Huang et al., 2014). In total, we identified 126 studies that reported line means or raw data for 38,411 phenotypic traits. 19,259 of these were female traits and 19,081 male traits (the remaining 71 were estimated from mixed sex groups). 36,280 of these were reported by a single study, which measured whole-body expression level for many of the known genes in D. melanogaster (Huang et al., 2015). We res..., , # Data from: Quantifying the phenome-wide response to sex-specific selection in Drosophila melanogaster

https://doi.org/10.5061/dryad.2v6wwpzzp

Description of the data and file structure

Data concern a vast range of quantitative traits measured across the Drosophila genetic reference panel. With these data you can run all analyses presented in the associated manuscript.

If you would like to use the dataset for your own analysis, download meta_data_for_all_traits.csvÂ and either all.dgrp.phenos_unscaled.csvÂ or all.dgrp.phenos_scaled.csvÂ depending on whether you want to use traits expressed on their original scale or on a standardised scale (mean = 0, SD = 1). Read this to see how we collated the dataset and conducted quality control. Read this to see how we further clea...
Escherichia coli metadata files obtained from isolates listed in Enterobase...
figshare.com
txt
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Albert Schriefer (2024). Escherichia coli metadata files obtained from isolates listed in Enterobase and PATRIC databases. [Dataset]. http://doi.org/10.6084/m9.figshare.25796323.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25796323.v3
Dataset updated
Jul 1, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Albert Schriefer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SampleMiner (SM) is a Python script that helps researchers dependent on limited internet resources to locally perform microbial isolates sampling, metadata retrieval and whole-genome sequences (WGS) download, provided the metadata pages of the online database can be downloaded as comma-separated value (csv) files. The metadata tab-separated value (tsv) files listed in this Figshare item were used during the development of SM code. These files contain all E. coli isolates metadata obtained from Enterobase on 18/11/2021 and from PATRIC on 29/9/2021, using Escherichia or Shigella as combined keywords. The originally downloaded csv files were converted into the tsv files.
Z
Relevant datasets for sgRNA library characterization tasks
data.niaid.nih.gov
resodate.org
Updated Aug 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivek Das (2023). Relevant datasets for sgRNA library characterization tasks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8285482
Explore at:
Dataset updated
Aug 27, 2023
Dataset provided by
Novo Nordisk A/S
Authors
Vivek Das
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This contains all the relevant datasets that were downloaded, used and generated to execute the sgRNA library characterization in the notebook sgRNA_Ryvu.ipynb.

This inludes a list of files:

the hg38 fasta file in chunks

The necessary indexes build with Bowtie 2

The relevant mapped sam and bam files that are indexed and sorted

The hg38 gtf file for gene annotation

Relevant GDCquery download file for BRCA

Final gene expresssion matrix file in CSV format.
r
nf-core/metatdenovo taxonomy
researchdata.se
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lundin (2025). nf-core/metatdenovo taxonomy [Dataset]. http://doi.org/10.17044/SCILIFELAB.28211678
Explore at:
Unique identifier
https://doi.org/10.17044/SCILIFELAB.28211678
Dataset updated
Feb 24, 2025
Dataset provided by
Linnaeus University
Authors
Daniel Lundin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data in this repository can be used to assign taxonomy to sequences with Diamond [Buchfink et al. 2015], particularly using the --diamond_dbs parameter in nf-core/metatdenovo (https://nf-co.re/metatdenovo) , release 1.1 or later.Currently, the data available represents species-representative genomes from the Genome Taxonomy Database (GTDB), release R09-RS220 [Parks et al. 2018].

File preparationAll species-representative genomes from GTDB were downloaded from the National Center for Biotechnology Information (NCBI) and annotated with Prokka [v. 1.14.6; Seemann 2014], and the sequences for all resulting proteins were used for this data. The taxonomy dump files (in NCBI taxonomy dump format) were created from the GTDB metadata with TaxonKit [v. 0.18.0; Shen and Ren 2021] and the Diamond database with Diamond [v. 2.1.10; Buchfink et al. 2015] in "taxonomy mode", i.e. using the taxonomy dump created with TaxonKit. (See below for commands used.)

File descriptionsThere are five files:

gtdb-r220.faa.gz: Fasta file with protein sequences. Not used by nf-core/metatdenovo but can be used to create the Diamond database below.

gtdb-r220.taxonomy.dmnd: Diamond database with taxonomy information.

gtdb-r220.names.dmp: Taxonomy dump file.

gtdb-r220.nodes.dmp: Nodes dump file.

gtdb-r220.seqid2taxid.tsv.gz: Mapping from protein accession to taxon. The Diamond database and taxonomy dump files can be used with nf-core/metatdenovo (Version >1.1) by providing a csv file like below to the --diamond_dbs parameter. (Although Nextflow can use https-urls for paths, it is usually better to download the very large files and keep local copies.)

db,dmnd_path,taxdump_names,taxdump_nodes,ranks,parse_with_taxdump

gtdb,gtdb_r220_repr.dmnd,gtdb_taxdump/names.dmp,gtdb_taxdump/nodes.dmp,domain;phylum;class;order;genus;species;strain,

Commands used to prepare taxonomy dump files and the Diamond database- Taxonomy dump: cut -f 1,19-20 metadata.tsv | grep -v 'accession' | awk 'BEGIN { FS="\t" } { if ( $2 == "t" ) { print $1 "\t" $3 } }' | taxonkit create-taxdump --gtdb -O . - Diamond database: gunzip -c gtdb-r220.faa.gz | sed '/^>/s/ .//' | diamond makedb --taxonmap gtdb-r220.seqid2taxid.tsv.gz --taxonnames gtdb-r220.names.dmp --taxonnodes gtdb-r220.nodes.dmp --db gtdb-r220.taxonomy.dmnd --no-parse-seqids Revision history20250211 First version
Pan-cancer Aberrant Pathway Activity Analysis (PAPAA)
zenodo.org
application/gzip, csv +1
Updated Dec 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DANIEL BLANKENBERG; DANIEL BLANKENBERG; VIJAY NAGAMPALLI; VIJAY NAGAMPALLI (2020). Pan-cancer Aberrant Pathway Activity Analysis (PAPAA) [Dataset]. http://doi.org/10.5281/zenodo.3625201
Explore at:
tsv, application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3625201
Dataset updated
Dec 5, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
DANIEL BLANKENBERG; DANIEL BLANKENBERG; VIJAY NAGAMPALLI; VIJAY NAGAMPALLI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Information about the dataset files:

1) pancan_rnaseq_freeze.tsv.gz: Publicly available gene expression data for the TCGA Pan-cancer dataset. File: PanCanAtlas EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611] [https://doi.org/10.1016/j.celrep.2018.03.046]

2) pancan_mutation_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset. File: mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]

3) pancan_GISTIC_threshold.tsv.gz: Publicly available Gene- level copy number information of the TCGA Pan-cancer dataset. This file is processed using script process_copynumber.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. The files copy_number_loss_status.tsv.gz and copy_number_gain_status.tsv.gz generated from this data are used as inputs in our Galaxy pipeline. [https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443] [https://doi.org/10.1016/j.celrep.2018.03.046]

4) mutation_burden_freeze.tsv.gz: Publicly available Mutational information for TCGA Pan-cancer dataset mc3.v0.2.8.PUBLIC.maf.gz was processed using script process_sample_freeze.py by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/][http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc] [https://doi.org/10.1016/j.celrep.2018.03.046]

5) sample_freeze.tsv or sample_freeze_version4_modify.tsv: The file lists the frozen samples as determined by TCGA PanCancer Atlas consortium along with raw RNAseq and mutation data. These were previously determined and included for all downstream analysis All other datasets were processed and subset according to the frozen samples.[https://github.com/greenelab/pancancer/]

6) vogelstein_cancergenes.tsv: compendium of OG and TSG used for the analysis. [https://github.com/greenelab/pancancer/]

7) CCLE_DepMap_18Q1_maf_20180207.txt.gz Publicly available Mutational data for CCLE cell lines from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2FCCLE_DepMap_18Q1_maf_20180207.txt]

8) ccle_rnaseq_genes_rpkm_20180929.gct.gz: Publicly available Expression data for 1019 cell lines (RPKM) from Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://depmap.org/portal/download/api/download/external?file_name=ccle%2Fccle_2019%2FCCLE_RNAseq_genes_rpkm_20180929.gct.gz]

9) CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct: Publicly available merged Mutational and copy number alterations that include gene amplifications and deletions for the CCLE cell lines. This data is represented in the binary format and provided by the Broad Institute Cancer Cell Line Encyclopedia (CCLE) / DepMap Portal. [https://data.broadinstitute.org/ccle_legacy_data/binary_calls_for_copy_number_and_mutation_data/CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct]

10) GDSC_cell_lines_EXP_CCLE_names.csv.gz Publicly available RMA normalized expression data for Genomics of Drug Sensitivity in Cancer(GDSC) cell-lines. File gdsc_cell_line_RMA_proc_basalExp.csv was downloaded. This data was subsetted to 389 cell lines that are common among CCLE and GDSC. All the GDSC cell line names were replaced with CCLE cell line names for further processing. [https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip]

11) GDSC_CCLE_common_mut_cnv_binary.csv.gz: A subset of merged Mutational and copy number alterations that include gene amplifications and deletions for common cell lines between GDSC and CCLE. This file is generated using CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct and a list of common cell lines.

12) gdsc1_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC1 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_15Oct19.xlsx]

13) gdsc2_ccle_pharm_fitted_dose_data.txt.gz: Pharmacological data for GDSC2 cell lines. [ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC2_fitted_dose_response_15Oct19.xlsx]

14) compounds.csv: list of pharmacological compounds tested for our analysis

15) tcga_dictonary.tsv: list of cancer types used in the analysis.

16) seg_based_scores.tsv: Measurement of total copy number burden, Percent of genome altered by copy number alterations. This file was used as part of the Pancancer analysis by Gregory Way et al as described in https://github.com/greenelab/pancancer/ data processing and initialization steps. [https://github.com/greenelab/pancancer/]
n
Simulating population divergence of Northern chamois in the Alps based on...
cmr.earthdata.nasa.gov
envidat.ch
Updated Sep 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Simulating population divergence of Northern chamois in the Alps based on habitat dynamics [Dataset]. http://doi.org/10.16904/envidat.291
Explore at:
Unique identifier
https://doi.org/10.16904/envidat.291
Dataset updated
Sep 27, 2023
Time period covered
Jan 1, 2022
Area covered

Description
General description Genomic data, habitat suitability raster files and scripts to run gen3sis to simulate cumulative divergence over time as approximation for genetic differentiation. Scripts for basic analysis of the simulations (e.g., create distance matrix from sampling locations) are provided, too. See original publication (doi link will be provided after publication) for details. The study area are the European Alps. All data is uploaded as zipped file. Unzip them after the download and put all data in one folder. See linked publications for correct citation of the data used, use of the data without correct citation is not allowed. Corresponding author: Flurin Leugger, email: flurin.leugger@gmail.com # Description of the data (content of the different zip folders) ## Abiotic data ### Glaciers Folders with raster stacks with glaciated areas at 0.05° resolution in WGS84 projection from Seguinot et al. (2018). Seguinot, J., Ivy-Ochs, S., Jouvet, G., Huss, M., Funk, M., & Preusser, F. (2018). Modelling last glacial cycle ice dynamics in the Alps. The Cryosphere, 12(10), 3265–3285. https://doi.org/10.5194/tc-12-3265-2018 ### Rivers * river_raster_elevation_class.tif: raster file (.tif) at 0.05° resolution and WGS84 projection with large rivers (scenario 2 from publication). The rivers (each cell) is classified according to the elevation of the cell. Natural Earth. (2018). Rivers + lake centerlines version 4.1.0. Retrieved January 22, 2020, from https://www.naturalearthdata.com/downloads/50m-physical-vectors/50m-rivers-lake-centerlines * river_raster_strahler_class_5km.tif: raster file at 0.05° resolution and WGS84 projection with medium rivers. The rivers are classified according to their Strahler order. Food and Agriculture Organization of the United Nations. (2014). Rivers in Europe (Derived from HydroSHEDS). Retrieved January 29, 2020, from http://www.fao.org/geonetwork/srv/fr/google.kml?uuid=e0243940-e5d9-487c-8102-45180cf1a99f&layers=AQUAMAPS:37253_rivers_europe ## Fossil records * chamois_fossil_combined_public.xlsx: list with fossil records until 20,000 years BP from Central Europe, see linked references for citation. ## Chamois occurrences * chamois_occurrence.csv: Chamois presences from all sources used for the publication (see Suppl. mat. Table S1 for detailed information and correct citations of the data) aggregated at 0.05° resolution (~5km). ## Gen3sis * config: folders with all configuration files used to run the simulations for the publication (different dispersal divergence parameters). * scripts: scripts (and helper functions) to run the gen3sis simulations including scripts for the beginning of the subsequent analysis. ## Genetic * populations.snps.light.vcf: vcf file of the sampled Northern chamois (Rupicapra rupicapra) . The genomic data encompasses 20k SNPs (from ddRAD sequencing). * Sequencing_final_without_slovakia.txt: sampling locations of Northern chamois (Rupicapra rupicapra) ## HSM * habitat_suitability_hindcasting: Aggregated habitat suitability raster files (stacks, .grd files) at 0.05° resolution and WGS84 projection from 20,000 years BP until today in 100 year time steps. There are separate folders for each environmental variable scenario used (different terrain slope variables) an the different occurrence/pseudo-absence sampling strategy used. * ODMAP_LeuggerEtAl2021-10-25.csv_: ODMAP protocol
d
Salmonella enterica pangenome graph and variant call data for 539,283...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Salmonella enterica pangenome graph and variant call data for 539,283 genomes [Dataset]. https://catalog.data.gov/dataset/isalmonella-enterica-ipangenome-graph-and-variant-call-data-for-539283-genomes
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Agricultural Research Service
Description
Salmonella pangenome graph and variant call data for 539,283 genomes Description: Salmonella enterica causes human disease and decreases agricultural production. The overall goals of this project is to generate a large database of S. enterica variants with 539,283 samples and 236,069 features for applications in machine learning and genomics. We transformed single nucleotide polymorphism (SNP) data into reduced dimensional representations which are tolerant of missing data based on disentangled variational autoencoders. TFRecord files were made with custom Python scripts that parsed the variant call formats (VCF) into sparse tensors and combined them with the Salmonella In Silico Typing Resource (SISTR) serotype data. The data directory contains: The tar file of TFRecords: tfrecords.tar (103 GB). The TFRecords are organized first by how they were genotyped. mpileup records were created with Mpileup, and the gvg records were created with graph variant calling. In each of these directories batches of ~10,000 sequence reads named Sra10k_XX.tfrecord.gz (00--54). File Sra10k_99.tfrecord.gz contains incomplete SRAs. Each TFRecord contains the shape of the tensor, the indices of non-zero variants, sample name, serotype, and sparse values. Value 99 was assigned to '.' records.The file output.tar (11.4 TB) contains the .vcf files used to create the TFRecords above. The data in here is contained more succinctly in the TTFrecord format. This data will not normally be used.A tar file of metadata files for the samples, metadata (95 MB). Sequence read archive (SRA) accessions were downloaded using edirect/eutilities and saved as SraAccList.txt. esearch -db sra -query "txid28901[Organism:exp] AND (cluster_public[prop] AND 'biomol dna'[Properties] AND 'library layout paired'[Properties] AND 'platform illumina'[Properties] AND 'strategy wgs'[Properties] OR 'strategy wga'[Properties] OR 'strategy wcs'[Properties] OR 'strategy clone'[Properties] OR 'strategy finishing'[Properties] OR 'strategy validation'[Properties])" | efetch -format runinfo -mode xml | xtract -pattern Row -element Run > SraAccList.txt Google BigQuery was used to download metadata for the SRA accessions from the National Institute of Health (NIH). SELECT * FROM nih-sra-datastore.sra.metadata as metadata INNER JOIN {table_id} as leiacc ON metadata.acc = leiacc.accID; Files were processed into batches of ~10,000 and named Sra_completed_XX.csv (00--53). A VCF document mapping the TFRecord data to the positions in the graph subjected to the Type strain LT2: mapping/DRR452337.gvg.vcf-with_TFRecord_in_1st_column.txtScripts for creating and reading TFRecord data: code. reading_and_parsing_fns.py defines functions for converting VCFs of variants called using gvg to sparse tensors and makes the TFRecord files.gvg_to_tfrecord.py creates TFRecords from from the sparse tensors. Tutorial for using the TFRecords: Example_logistic_regression.mdPangenome graph files and references used for variant calling and genotyping: pangenome. refPlus100.fasta.gz which contains the genomes of the 101 Salmonella strains without plasmids used for construction of the pangenome graph.salm.100.NC_003197_v2.d2_complete.gfa.gz The complete 101 Salmonella strain pangenome graph in Graphical Fragment Assembly (GFA2) Format 2.0 including alt nodes used for genotypingsalm.100.NC_003197_v2.full.gfa.gz the full graph including alt nodes.salm.100.NC_003197_v2.full.vcf.gz A VCF of the file abovegenotyped.gvg.vcf the genotype calls in vcf formatpaths.txt the paths of the graph SCINet users: The data folder can be accessed/retrieved with valid SCINet account at this location: /LTS/ADCdatastorage/NAL/published/node28083194/ See the SCINet File Transfer guide for more information on moving large files: https://scinet.usda.gov/guides/data/datatransfer Globus users: The files can also be accessed through Globus by following this data link. The user will need to log in to Globus in order to access this data. User accounts are free of charge with several options for signing on. Instructions for creating an account are on the login page.
SNP dataset for GWAS
kaggle.com
zip
Updated Feb 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piotr Szulc (2023). SNP dataset for GWAS [Dataset]. https://www.kaggle.com/datasets/seascape/snp-dataset-for-gwas
Explore at:
zip(143503906 bytes)Available download formats
Dataset updated
Feb 27, 2023
Authors
Piotr Szulc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The data includes genotypes of 482906 markers for 1,000 individuals. They come from a simulation based on the Illumina 650K human array, typically used for SNP genotyping.

In theory, it's easy to create such data, it's just columns with values of 0, 1 and 2, but what's important is the correlation structure that has been preserved here and corresponds to the real one.

The data can be used to test methods for finding significant SNPs. You can generate a trait based on the significant variables of your choice, and then try to find them using the chosen technique (which is not easy, due to the huge number of variables).

The y.txt file contains the trait I simulated based on the following list of 24 SNPs: - ch01_19810 - ch01_27796 - ch01_32763 - ch02_22034 - ch02_39189 - ch03_2703 - ch03_10846 - ch04_05127 - ch05_7371 - ch06_25838 - ch08_15190 - ch10_444 - ch10_8265 - ch11_12611 - ch11_20057 - ch12_3421 - ch14_6999 - ch15_3859 - ch16_4525 - ch17_4306 - ch18_1031 - ch19_1377 - ch19_6378 - ch22_33

See which ones you can find!

Facebook

Twitter

Click to copy link

Link copied

Cite

Tuhin Rana (2023). RNA-seq example data [Dataset]. https://www.kaggle.com/datasets/rana2hin/rna-seq-example-data

RNA-seq example data

expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

Explore at:

9 scholarly articles cite this dataset (View in Google Scholar)

zip(2193914798 bytes)Available download formats

Dataset updated

Jun 16, 2023

Authors

Tuhin Rana

License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

Dataset Description

This dataset contains RNA-seq data from human cells. The data was collected using the Illumina HiSeq 2500 platform. The data includes raw sequencing reads, gene annotations, and phenotypic data for the samples.

Files and Folders

Files can be downloaded using the following command:

wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Once the file has been downloaded, it can be extracted using the following command:

tar xvzf chrX_data.tar.gz

This will create a directory called chrX_data containing the following files:

genes/chrX.gtf
genome/chrX.fa
geuvadis_phenodata.csv
indexes/
mergelist.txt
samples/

Here are some additional details about the files in the chrX_data directory:

genes/chrX.gtf - This file contains gene annotations for the human X chromosome. It is in the GTF format, which is a standard format for gene annotations. The GTF file contains information about the start and end positions of genes, as well as their transcripts.
genome/chrX.fa - This file contains the reference genome sequence for the human X chromosome. It is in the FASTA format, which is a standard format for storing DNA sequences.
geuvadis_phenodata.csv - This file contains phenotypic data for the samples in the dataset. The phenotypic data includes information such as the age, sex, and disease status of the samples.
indexes/ - This directory contains index files for HISAT2. Index files are used to speed up the alignment of sequencing reads to a reference genome.
mergelist.txt - This file lists the samples to be merged. The samples in the samples/ directory can be merged using a variety of tools, such as BEDTools and STAR.
samples/ - This directory contains the raw sequencing data. The raw sequencing data is in the FASTQ format, which is a standard format for storing sequencing reads.

Usage

This dataset can be used to perform RNA-seq analysis using a variety of tools, such as HISAT2, StringTie, and Ballgown.

Here are some examples of how this dataset can be used:

To identify differentially expressed genes between two groups of samples.
To build a gene expression atlas for a particular tissue or cell type.
To study the expression of genes involved in a particular disease.

source: ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Clear search

Close search

Google apps

Main menu

RNA-seq example data

LifeDB

Genomes To Fields 2016

Drosophila Melanogaster Genome

Drosophila Melanogaster

About the Genome

Bioinformatics

Learning Bioinformatics

Files in this Dataset

Data from: Cell-specific gene-expression profiles and cortical thickness in...

Data from: Distinct evolutionary trajectories in the Escherichia coli...

Data from: MarFERReT: an open-source, version-controlled reference library...

Data from: Genomes To Fields (G2F) Inbred Ear Imaging Data 2017

Genomes To Fields 2014

Gene Expression V2

Data from: Expressed Sequence Tags from the Ciliate Protozoan Parasite...

Table_5_An integrative analysis of single-cell and bulk transcriptome and...

Data from: Quantifying the phenome-wide response to sex-specific selection...

Description of the data and file structure

Escherichia coli metadata files obtained from isolates listed in Enterobase...

Relevant datasets for sgRNA library characterization tasks

nf-core/metatdenovo taxonomy

Pan-cancer Aberrant Pathway Activity Analysis (PAPAA)

Simulating population divergence of Northern chamois in the Alps based on...

Salmonella enterica pangenome graph and variant call data for 539,283...

SNP dataset for GWAS

RNA-seq example data

expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown