2 datasets found

Z
GEO gene expression dataset recompute for selected tumor samples
data.niaid.nih.gov
Updated May 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Visentin, Luca (2024). GEO gene expression dataset recompute for selected tumor samples [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10817923
Explore at:
Dataset updated
May 13, 2024
Dataset authored and provided by
Visentin, Luca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We aligned and quantified RNA-Seq data present in GEO with a standardized pipeline to homogenize data preprocessing for downstream applications.

All uploaded files are UTF-8, .csv-formatted matrices. The *_expected_count.csv.gz files are unlogged, raw expression counts as reported by rsem-quantify-expression (see details below). The associated *_metadata.csv.gz files contain metadata pertinent to each column of the corresponding expression matrix.Some metadata files may have more rows than the associated number of columns. This is for series that were only partially RNA-Seq based (e.g. combinated RNA-Seq plus miRNA-Seq samples in the same GEO accession ID).

Metadata columns are derived from GEO series files, and follow their definitions. See each GEO entry directly to determine metadata meaning.

Each recompute has at least the gene_id column holding Ensembl Gene IDs. The remaining columns are ENA run accession IDs of the specific recomputed samples.Each associated metadata has at least the following columns:

geo_accession: The GEO sample ID of the sample.

ena_sample: The ENA sample ID of the sample.

ena_run: The ENA run accession ID of the sample, to be cross-referenced with the expression matrices.

The remaining columns are derived from GEO metadata files and other ENA-provided data. Please refer to the x.FASTQ package for more information.

Pipeline Details

The alignment and quantification was made with the x.FASTQ tool available on Github installed locally on an Arch Linux machine on commit 3a93dd77a70df59c74f7b15216c26f12cd918e81 running the Linux 6.7.8-zen1-1-zen kernel with a 11th Gen Intel i7-1185G7 (8) CPU and a Intel TigerLake-LP GT2 [Iris Xe Graphics] GPU. Please note that no sample filtering or omissions were done based on sample quality or sequencing depth. However, sensible trimming (e.g. low-quality bases and common adapters) was performed on all the samples.

Reference genome was downloaded from Ensembl, version hg38. STAR was used to create the index genome with overhang set to 149.
GEO gene expression dataset recompute for selected tumor samples
zenodo.org
application/gzip
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luca Visentin; Luca Visentin (2024). GEO gene expression dataset recompute for selected tumor samples [Dataset]. http://doi.org/10.5281/zenodo.10817924
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10817924
Dataset updated
Mar 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luca Visentin; Luca Visentin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

We aligned and quantified RNA-Seq data present in GEO with a standardized pipeline to homogenize data preprocessing for downstream applications.

All uploaded files are UTF-8, `.csv`-formatted matrices. The `*_expected_count.csv.gz` files are unlogged, raw expression counts as reported by `rsem-quantify-expression` (see details below). The associated `*_metadata.csv.gz` files contain metadata pertinent to each column of the corresponding expression matrix.
Some metadata files may have more rows than the associated number of columns. This is for series that were only partially RNA-Seq based (e.g. combinated RNA-Seq plus miRNA-Seq samples in the same GEO accession ID).

Metadata columns are derived from GEO series files, and follow their definitions. See each GEO entry directly to determine metadata meaning.

Each recompute has at least the `gene_id` column holding Ensembl Gene IDs. The remaining columns are ENA run accession IDs of the specific recomputed samples.
Each associated metadata has at least the following columns:
- `geo_accession`: The GEO sample ID of the sample.
- `sample_accession`: The ENA sample ID of the sample.
- `run_accession`: The ENA run accession ID of the sample, to be cross-referenced with the expression matrices.

## Pipeline Details

The alignment and quantification was made with the `x.FASTQ` tool available [on Github](https://github.com/TCP-Lab/x.FASTQ) installed locally on an Arch Linux machine running the Linux `6.7.8-zen1-1-zen` kernel with a `11th Gen Intel i7-1185G7 (8)` CPU and a `Intel TigerLake-LP GT2 [Iris Xe Graphics]` GPU.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Visentin, Luca (2024). GEO gene expression dataset recompute for selected tumor samples [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10817923

GEO gene expression dataset recompute for selected tumor samples

Explore at:

Dataset updated

May 13, 2024

Dataset authored and provided by

Visentin, Luca

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We aligned and quantified RNA-Seq data present in GEO with a standardized pipeline to homogenize data preprocessing for downstream applications.

All uploaded files are UTF-8, .csv-formatted matrices. The *_expected_count.csv.gz files are unlogged, raw expression counts as reported by rsem-quantify-expression (see details below). The associated *_metadata.csv.gz files contain metadata pertinent to each column of the corresponding expression matrix.Some metadata files may have more rows than the associated number of columns. This is for series that were only partially RNA-Seq based (e.g. combinated RNA-Seq plus miRNA-Seq samples in the same GEO accession ID).

Metadata columns are derived from GEO series files, and follow their definitions. See each GEO entry directly to determine metadata meaning.

Each recompute has at least the gene_id column holding Ensembl Gene IDs. The remaining columns are ENA run accession IDs of the specific recomputed samples.Each associated metadata has at least the following columns:

geo_accession: The GEO sample ID of the sample.

ena_sample: The ENA sample ID of the sample.

ena_run: The ENA run accession ID of the sample, to be cross-referenced with the expression matrices.

The remaining columns are derived from GEO metadata files and other ENA-provided data. Please refer to the x.FASTQ package for more information.

Pipeline Details

The alignment and quantification was made with the x.FASTQ tool available on Github installed locally on an Arch Linux machine on commit 3a93dd77a70df59c74f7b15216c26f12cd918e81 running the Linux 6.7.8-zen1-1-zen kernel with a 11th Gen Intel i7-1185G7 (8) CPU and a Intel TigerLake-LP GT2 [Iris Xe Graphics] GPU. Please note that no sample filtering or omissions were done based on sample quality or sequencing depth. However, sensible trimming (e.g. low-quality bases and common adapters) was performed on all the samples.

Reference genome was downloaded from Ensembl, version hg38. STAR was used to create the index genome with overhang set to 149.

Clear search

Close search

Google apps

Main menu