20 datasets found

d
Data from: Genomes To Fields (G2F) Inbred Ear Imaging Data 2017
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Genomes To Fields (G2F) Inbred Ear Imaging Data 2017 [Dataset]. https://catalog.data.gov/dataset/genomes-to-fields-g2f-inbred-ear-imaging-data-2017-079c0
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
A subset of ~30 inbreds were evaluated in 2014 and 2015 to develop an image based ear phenotyping tool. The data is stored in CyVerse. Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields Inbred Ear Imaging 2017 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Edgar_Spalding_G2F_Inbred_Ear_Imaging_June_2017 Dataset (csv, tar.gz) and metadata (BibTex/Endnote) downloads. See _readme.txt for file contents.
KMCP Manuscript Data
zenodo.org
data.niaid.nih.gov
+1more
Updated Dec 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Shen; Wei Shen (2022). KMCP Manuscript Data [Dataset]. http://doi.org/10.5281/zenodo.6334744
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6334744
Dataset updated
Dec 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wei Shen; Wei Shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping

## 1.code-and-documents

This directory contains the source code, executable binaries, and documents of KMCP,
which are also hosted at Github: https://github.com/shenwei356/kmcp .

Databases, usage, and tutorials of KMCP are also available at https://bioinf.shenwei.me/kmcp/.

- [Installation](https://bioinf.shenwei.me/kmcp/download)
- [Databases](https://bioinf.shenwei.me/kmcp/database)
- Tutorials
- [Taxonomic profiling](https://bioinf.shenwei.me/kmcp/tutorial/profiling)
- [Sequence and genome searching](https://bioinf.shenwei.me/kmcp/tutorial/searching)
- [Usage](https://bioinf.shenwei.me/kmcp/usage)
- [Benchmarks](https://bioinf.shenwei.me/kmcp/benchmark)
- [FAQs](https://bioinf.shenwei.me/kmcp/faq)

## 2.databases

This directory contains the building steps and reference genome accessions for
KMCP databases used in the manuscript.

cami2 Databases used in benchmarks on CAMI2 mouse gut datasets
kmcp Databases used in other benchmarks

## 3.figures

Each subdirectory contains steps to run the benchmark (`README.md`), steps for plotting (`README-plot.md`),
benchmark results, and figures.
f
Data from: Building a Statistical Model for Predicting Cancer Genes
datasetcatalog.nlm.nih.gov
Updated Nov 15, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amos, Christopher; Logothetis, Christopher J.; Gorlov, Ivan P.; Fang, Shenying; Gorlova, Olga Y. (2012). Building a Statistical Model for Predicting Cancer Genes [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001123791
Explore at:
Dataset updated
Nov 15, 2012
Authors
Amos, Christopher; Logothetis, Christopher J.; Gorlov, Ivan P.; Fang, Shenying; Gorlova, Olga Y.
Description
More than 400 cancer genes have been identified in the human genome. The list is not yet complete. Statistical models predicting cancer genes may help with identification of novel cancer gene candidates. We used known prostate cancer (PCa) genes (identified through KnowledgeNet) as a training set to build a binary logistic regression model identifying PCa genes. Internal and external validation of the model was conducted using a validation set (also from KnowledgeNet), permutations, and external data on genes with recurrent prostate tumor mutations. We evaluated a set of 33 gene characteristics as predictors. Sixteen of the original 33 predictors were significant in the model. We found that a typical PCa gene is a prostate-specific transcription factor, kinase, or phosphatase with high interindividual variance of the expression level in adjacent normal prostate tissue and differential expression between normal prostate tissue and primary tumor. PCa genes are likely to have an antiapoptotic effect and to play a role in cell proliferation, angiogenesis, and cell adhesion. Their proteins are likely to be ubiquitinated or sumoylated but not acetylated. A number of novel PCa candidates have been proposed. Functional annotations of novel candidates identified antiapoptosis, regulation of cell proliferation, positive regulation of kinase activity, positive regulation of transferase activity, angiogenesis, positive regulation of cell division, and cell adhesion as top functions. We provide the list of the top 200 predicted PCa genes, which can be used as candidates for experimental validation. The model may be modified to predict genes for other cancer sites.
u
Genomes To Fields 2016
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darwin Campbel; Natalia deLeon; Jode Edwards; Jack Gardiner; Naser Al Khalifah; Carolyn J. Lawrence-Dill; Jane Petzoldt; Cinta Romay; Renee Walton; Genomes to Fields Cooperators (2023). Genomes To Fields 2016 [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Genomes_To_Fields_2016/24852822
Explore at:
binAvailable download formats
Dataset updated
Dec 18, 2023
Dataset provided by
CyVerse Data Commons
Authors
Darwin Campbel; Natalia deLeon; Jode Edwards; Jack Gardiner; Naser Al Khalifah; Carolyn J. Lawrence-Dill; Jane Petzoldt; Cinta Romay; Renee Walton; Genomes to Fields Cooperators
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Phenotypic, genotypic, and environment data for the 2016 field season: The data is stored in CyVerse. Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental (soil, weather) data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields 2016 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/GenomesToFields_G2F_2016_Data_Mar_2018 Dataset (csv) and metadata (BibTex, Endnote) data downloads. See _readme.txt for file contents.
Data from: EukProt: a database of genome-scale predicted proteins across the...
figshare.com
datasetcatalog.nlm.nih.gov
+1more
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas (2023). EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotes [Dataset]. http://doi.org/10.6084/m9.figshare.12417881.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12417881.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Version 3 (22 November, 2021)

See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above).

Scroll to the end of this page for changes since version 2.

Are we missing anything? Please let us know!

EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.

This release contains 5 files:

EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17).

EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file.

EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:

EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.

Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files.

Strain: the strain(s) of the species sequenced.

Previous_Names: any previous names that this species was known by.

Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).

Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691).

Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).

Merged_Strains: whether multiple strains of the same species were merged to create the data set.

Data_Source_URL: the URL(s) from which the data were downloaded.

Data_Source_Name: the name of the data set (as assigned by the data source).

Paper_DOI: the DOI(s) of the paper(s) that published the data set.

Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified.

Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).

Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included).

Columns_Modified_Since_Previous_Version: column(s) in this file modified for the data set since the previous release. Not listed: modifications to the Notes column or to new columns added in this version.

Alternative_Strain_Names: non-exhaustive list of alternative names for the sequenced strain for this data set.

18S_Sequence_GenBank_ID: GenBank identifier for the strain sequenced in the data set. When multiple strains were sequenced, identifiers are separated with a comma, in the same order as the Strain column. Ranges of identifiers for the same strain are separated by a hyphen. ‘N/A’ indicates either that there is no GenBank sequence for the strain or that all available sequences are not full-length (< 1,500 bp).

18S_Sequence: 18S for the strain derived from publicly available sequences associated with the data set, in the case where a GenBank sequence is not available.

18S_Sequence_Source: the source for the sequence in the 18S_Sequence column, if any.

18S_Sequence_Other_Strain_GenBank_ID: GenBank identifier for 18S sequence(s) from other strains of the same species as the data set.

18S_Sequence_Other_Strain_Name: strain name(s) for the sequences in the 18S_Sequence_Other_Strain_GenBank_ID column.

18S_and_Taxonomy_Notes: additional information on the values in the 18S_Sequence columns.

Changes since version 2

There are 324 new data sets included. 57 of these replace data sets from version 2.

40 newly published data sets were added to the list that are not included in the database (annotated in the Notes column with the reasons they were not included).

Instead of unannotated genomes (for published genomes lacking protein predictions), we now include predicted proteins and gene annotations (in GFF3 format).

All sequences within each file are now assigned a standardized, unique identifier based on the data set’s EukProt_ID and on the type of data (protein or transcriptome). Illegal characters are removed from sequences.

In the UniEuk_Taxonomy field, single quotes are now used instead of double quotes, to be consistent with other UniEuk databases (EukMap, EukRibo).

Changes to metadata of individual data sets (in the included and not_included tables) with respect to the previous version are now listed in the Columns_Modified_Since_Previous_Version column.

The Taxogroup_UniEuk column has been split into the Taxogroup1_UniEuk and Taxogroup2_UniEuk columns. This resulted in the Supergroup_UniEuk column changing for Opisthokonta.

In addition, the following new columns have been added (see our manuscript for details): Alternative_Strain_Names, 18S_Sequence_GenBank_ID, 18S_Sequence, 18S_Sequence_Source, 18S_Sequence_Other_Strain_GenBank_ID, 18S_Sequence_Other_Strain_Name, 18S_and_Taxonomy_Notes.

EukProt_assembled_transcriptomes.v03.2021_11_22.tgz: assembled transcriptome contigs, for 126 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file.

Sequence names in the proteins and transcriptomes files have standardized, unique identifiers with the following format:

[EukProt ID]_[Name_to_Use]_[Type abbreviation][Counter] [Previous header contents]

Type abbreviations are P (protein) and T (transcriptome).

All characters not in the following list are removed from nucleic acid sequences: ACGTNUKSYMWRBDHV All characters not in the the following list are removed from protein sequences: ABCDEFGHIKLMNPQRSTUVWYZX*

Lists of legal characters are from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

microbetag : building a thorough database of genome-scale KO annotations

zenodo.org

application/gzip, bin +1

Updated Jan 21, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Haris Zafeiropoulos; Haris Zafeiropoulos (2024). microbetag : building a thorough database of genome-scale KO annotations [Dataset]. http://doi.org/10.5281/zenodo.10537295

Explore at:

zip, bin, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10537295

Dataset updated

Jan 21, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Haris Zafeiropoulos; Haris Zafeiropoulos

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In this repository we keep internal data for the microbetag microbial co-occurrence network annotator.

microbetag makes use of 2-column files for each genome, indicating the KO term found and a KEGG module in which this terms takes part into.
As a single KO term might participates in more than one KEGG modules, the same KO might be more than once in an annotation file.

gtdb_modelseed_gems.zip md5:e3e62b305e64b27da7b80655d7f92f2c	for all the GTDB genomes their corresponding PATRIC annotations were gathered. Then, using modelseedpy we constructed their genome scale metabolic reconstructions
gtdb_kofam_scan_per_module.tar.gz md5:cbcc9aa1a28a5bd5f6661f832d27bcbf	all representative genomes of GTDB (v.202) were parsed and their corresponding `.faa` files were retrieved from the NCBI FTP. Then the kofam_scan tool was used to annotate them and finally a manual script was used to keep KOs of each genome per module.
updated_seedsets_of_interest.pckl md5:2a49534b52169daa4f96157b15d1b01c	A pickle file with the seeds of each GEM included in the gtdb_modelseed_gems.zip file and related to the KEGG MODULES based on the seedId_keggId_module.tsv file you can find on microbetag's GitHub page. Example: PATRIC SeedSet 373.172 [cpd00891, cpd00136, cpd00199, cpd01772, cpd00... 397278.5 [cpd00891, cpd00136, cpd01772, cpd02698, cpd08...
updated_non_seedsets_of_interest.pckl md5:75c435a69b183ea23aa58d43c1e051ba	A pickle file with the non seeds of each GEM included in the gtdb_modelseed_gems.zip file and related to the KEGG MODULES based on the seedId_keggId_module.tsv file you can find on microbetag's GitHub page. Example: PATRIC NonSeedSet 64187.548 [cpd00508, cpd00869, cpd00774, cpd03830, cpd00... 74426.1719 [cpd00204, cpd00447, cpd20171, cpd03470, cpd00...
phen_classes.zip md5:9e3f7a84fe7409ef0282ca5424797976	A list of pickle files with the re-trained classes of phenDB for the prediction of functional traits on a genome.

d
Data from: Plant Expression Database
catalog.data.gov
datasetcatalog.nlm.nih.gov
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Plant Expression Database [Dataset]. https://catalog.data.gov/dataset/plant-expression-database-8ddd3
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
[NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control. Resources in this dataset:Resource Title: Website Pointer for Plant Expression Database, Iowa State University. File Name: Web Page, url: https://www.bcb.iastate.edu/plant-expression-database [NOTE: PLEXdb is no longer available online. Oct 2019.] Project description for the Plant Expression Database (PLEXdb) and integrated tools.
c
Genomes To Fields 2014 v.1 - deprecated
datacommons.cyverse.org
Updated 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carolyn Lawrence-Dill (2016). Genomes To Fields 2014 v.1 - deprecated [Dataset]. http://doi.org/10.7946/P25P4P
Explore at:
Unique identifier
https://doi.org/10.7946/P25P4P
Dataset updated
2016
Dataset provided by
CyVerse Data Commons
Authors
Carolyn Lawrence-Dill
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.
d
Genomes To Fields 2014
datasets.ai
agdatacommons.nal.usda.gov
+1more
21
Updated Mar 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Agriculture (2024). Genomes To Fields 2014 [Dataset]. https://datasets.ai/datasets/genomes-to-fields-2014-d3326
Explore at:
21Available download formats
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Department of Agriculture
Description
Phenotypic, genotypic, and environment data for the 2014 field season: The data is stored in CyVerse.

Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.

Resources in this dataset:

Resource Title: CyVerse Genomes To Fields 2014 dataset download.

File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Carolyn_Lawrence_Dill_G2F_Nov_2016_V.3
Dataset (csv, h5, gz) and metadata (BibTex/Endnote) downloads. See _readme.txt for file contents.
c
The Cancer Genome Atlas Prostate Adenocarcinoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Feb 2, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2014). The Cancer Genome Atlas Prostate Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y
Explore at:
dicom, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y
Dataset updated
Feb 2, 2014
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
A framework for the estimation of the proportion of true discoveries in...
plos.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nik Tuzov (2023). A framework for the estimation of the proportion of true discoveries in single nucleotide variant detection studies for human data [Dataset]. http://doi.org/10.1371/journal.pone.0196058
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0196058
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nik Tuzov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Any single nucleotide variant detection study could benefit from a fast and cheap method of measuring the quality of variant call list. It is advantageous to be able to see how the call list quality is affected by different variant filtering thresholds and other adjustments to the study parameters. Here we look into a possibility of estimating the proportion of true positives in a single nucleotide variant call list for human data. Using whole-exome and whole-genome gold standard data sets for training, we focus on building a generic model that only relies on information available from any variant caller. We assess and compare the performance of different candidate models based on their practical accuracy. We find that the generic model delivers decent accuracy most of the time. Further, we conclude that its performance could be improved substantially by leveraging the variant quality metrics that are specific to each variant calling tool.
Genome indexes for Mus musculus (mm39)
zenodo.org
data-staging.niaid.nih.gov
bin
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yazgeldi Gamze; Yazgeldi Gamze; Katayama Shintaro; Katayama Shintaro (2023). Genome indexes for Mus musculus (mm39) [Dataset]. http://doi.org/10.5281/zenodo.7457660
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7457660
Dataset updated
Apr 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yazgeldi Gamze; Yazgeldi Gamze; Katayama Shintaro; Katayama Shintaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BUILDING HISAT2 INDEXES IN CSC
Here is the case for house mouse genome (mm39). The genome indexing step requires big memory and it might not be possible to carry out it on a laptop. Genome indexes for Mus musculus (mm39) were created using HISAT2 v2.2.1 on CSC (IT Center for Science), thanks to CSC-Puhti.

1. Create conda environment folder file to install the required packages, install and add the bin directory to the path.
mkdir STRTN-env
conda-containerize new --prefix STRTN-env STRTN-env.yml
export PATH="

2. Load the required module.
module load tykky
export PATH="

4. Extract splice sites and exons from a GTF file. Here we used wgEncodeGencodeBasicVM30 as the annotation file. You may additionally perform `hisat2_extract_snps_haplotypes_UCSC.py` to extract SNPs and haplotypes from a dbSNP file for human and mouse.
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/database/wgEncodeGencodeBasicVM30.txt.gz
unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_splice_sites.py - | grep -v ^chrUn > splice_sites.txt
unpigz -c wgEncodeGencodeBasicVM30.txt.gz | hisat2_extract_exons.py - | grep -v ^chrUn > exons.txt

5. Build the HISAT2 index. This outputs a set of files with suffixes. Here, `mouse_reference.1.ht2`, `mouse_reference.2.ht2`, ..., `mouse_reference.8.ht2` are generated.
In this case, `mouse_reference` is the basename used for `-i, --index`.
hisat2-build mouse_reference.fasta --ss splice_sites.txt --exon exons.txt mouse_index/mouse_reference

6. Create the sequence dictionary for the reference and Spike-in sequences. This is required for the Picard MergeBamAlignment program. Note that the original FASTA file (`mouse_reference.fasta` here) is also required.
picard CreateSequenceDictionary R=mouse_reference.fasta O=mouse_reference.dict

7. Put the genome indexes, genome fasta file, sequence dictionary to same folder.
mv mouse_reference.dict mouse_reference
mv mouse_reference.fasta mouse_reference
s
Kyoto Encyclopedia of Genes and Genomes Expression Database
scicrunch.org
rrid.site
Updated Jun 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Kyoto Encyclopedia of Genes and Genomes Expression Database [Dataset]. http://identifiers.org/RRID:SCR_001120
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001120
Dataset updated
Jun 10, 2018
Description
Database for mapping gene expression profiles to pathways and genomes. Repository of microarray gene expression profile data for Synechocystis PCC6803 (syn), Bacillus subtilis (bsu), Escherichia coli W3110 (ecj), Anabaena PCC7120 (ana), and other species contributed by the Japanese research community.
c
Genomes To Fields 2014 v.2 - deprecated
datacommons.cyverse.org
Updated 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carolyn Lawrence-Dill (2016). Genomes To Fields 2014 v.2 - deprecated [Dataset]. http://doi.org/10.7946/P2201Q
Explore at:
Unique identifier
https://doi.org/10.7946/P2201Q
Dataset updated
2016
Dataset provided by
CyVerse Data Commons
Authors
Carolyn Lawrence-Dill
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.
c
Genomes To Fields Inbred Ear Imaging 2017
datacommons.cyverse.org
Updated 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edgar Spalding (2017). Genomes To Fields Inbred Ear Imaging 2017 [Dataset]. http://doi.org/10.7946/P2C34P
Explore at:
Unique identifier
https://doi.org/10.7946/P2C34P
Dataset updated
2017
Dataset provided by
CyVerse Data Commons
Authors
Edgar Spalding
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.
c
The Cancer Genome Atlas Lung Adenocarcinoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Jan 30, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2017). The Cancer Genome Atlas Lung Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.JGNIHEP5
Explore at:
n/a, dicomAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.JGNIHEP5
Dataset updated
Jan 30, 2017
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Lung Phenotype Research Group.
Genomes To Fields Inbred Ear Imaging 2017 - Dataset - CyVerse Data Commons
ckan.cyverse.rocks
Updated Jun 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.cyverse.rocks (2024). Genomes To Fields Inbred Ear Imaging 2017 - Dataset - CyVerse Data Commons [Dataset]. https://ckan.cyverse.rocks/dataset/genomes-to-fields-inbred-ear-imaging-2017
Explore at:
Dataset updated
Jun 23, 2024
Dataset provided by
CKANhttps://ckan.org/
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.
f
Data from: MOESM3 of Designing intracellular metabolism for production of...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Dec 14, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomokazu Shirai; Takashi Osanai; Akihiko Kondo (2016). MOESM3 of Designing intracellular metabolism for production of target compounds by introducing a heterologous metabolic reaction based on a Synechosystis sp. 6803 genome-scale model [Dataset]. http://doi.org/10.6084/m9.figshare.c.3602546_D2.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3602546_D2.v1
Dataset updated
Dec 14, 2016
Dataset provided by
figshare
Authors
Tomokazu Shirai; Takashi Osanai; Akihiko Kondo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file : Table S3. List of metabolites ignored when building the SyHyMeP.
c
The Cancer Genome Atlas Rectum Adenocarcinoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Jan 5, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2016). The Cancer Genome Atlas Rectum Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.F7PPNPNU
Explore at:
dicom, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.F7PPNPNU
Dataset updated
Jan 5, 2016
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
c
The Cancer Genome Atlas Ovarian Cancer Collection
cancerimagingarchive.net
dicom, n/a
Updated May 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2020). The Cancer Genome Atlas Ovarian Cancer Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.NDO1MDFQ
Explore at:
n/a, dicomAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.NDO1MDFQ
Dataset updated
May 29, 2020
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Ovarian Cancer (TCGA-OV) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Ovarian Phenotype Research Group.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Agricultural Research Service (2025). Genomes To Fields (G2F) Inbred Ear Imaging Data 2017 [Dataset]. https://catalog.data.gov/dataset/genomes-to-fields-g2f-inbred-ear-imaging-data-2017-079c0

Data from: Genomes To Fields (G2F) Inbred Ear Imaging Data 2017

Explore at:

Dataset updated

Apr 21, 2025

Dataset provided by

Agricultural Research Service

Description

A subset of ~30 inbreds were evaluated in 2014 and 2015 to develop an image based ear phenotyping tool. The data is stored in CyVerse. Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields Inbred Ear Imaging 2017 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Edgar_Spalding_G2F_Inbred_Ear_Imaging_June_2017 Dataset (csv, tar.gz) and metadata (BibTex/Endnote) downloads. See _readme.txt for file contents.

Clear search

Close search

Google apps

Main menu

Data from: Genomes To Fields (G2F) Inbred Ear Imaging Data 2017

KMCP Manuscript Data

Data from: Building a Statistical Model for Predicting Cancer Genes

Genomes To Fields 2016

Data from: EukProt: a database of genome-scale predicted proteins across the...

microbetag : building a thorough database of genome-scale KO annotations

Data from: Plant Expression Database

Genomes To Fields 2014 v.1 - deprecated

Genomes To Fields 2014

The Cancer Genome Atlas Prostate Adenocarcinoma Collection

CIP TCGA Radiology Initiative

A framework for the estimation of the proportion of true discoveries in...

Genome indexes for Mus musculus (mm39)

Kyoto Encyclopedia of Genes and Genomes Expression Database

Genomes To Fields 2014 v.2 - deprecated

Genomes To Fields Inbred Ear Imaging 2017

The Cancer Genome Atlas Lung Adenocarcinoma Collection

CIP TCGA Radiology Initiative

Genomes To Fields Inbred Ear Imaging 2017 - Dataset - CyVerse Data Commons

Data from: MOESM3 of Designing intracellular metabolism for production of...

The Cancer Genome Atlas Rectum Adenocarcinoma Collection

CIP TCGA Radiology Initiative

The Cancer Genome Atlas Ovarian Cancer Collection

CIP TCGA Radiology Initiative

Data from: Genomes To Fields (G2F) Inbred Ear Imaging Data 2017