Spodoptera frugiperda is a noctuid moth that devastates various crops including corn, rice and cotton, and is found in most of the American continent. The purpose of this study was to integrate gene expression data from S. frugiperda guts and their associated metatranscriptomes, under natural and controlled conditions. For this, four S. frugiperda samples from the province of Tucumán (Argentina; subtropical region) were analysed. Specimens were obtained from different environments, altitudes and food sources, namely: 1) a transgenic maize (Zea mays) field at 495 m.a.s.l. where insecticides and fertilisers were applied (named MM; 26o49’50”S; 65o16’59.4”W); 2) Sorghum halepense at 495 m.a.s.l. (MS; 26o49’50”S; 65o16’59.4”W); 3) a maize field at 2283 m.a.s.l. where no insecticides or fertilisers were used (TV; 26o55’40.75”S; 65o45’19.90”W) ; and 4) a colony established from larvae originally collected from the same transgenic maize field as Sf_MM, reared for 9 generations under controlled conditions on an artificial diet adapted from [8], without the addition of antibiotics (BT). For all samples, total RNA extracted from fifth instar larvae guts (two digestive tracts per sample), was submitted to a modified one-step reverse transcription and polymerase chain reaction sequence-independent amplification procedure, as described previously. High-throughput pyrosequencing of the samples was performed using a Roche GS FLX (Macrogen Inc., Korea), yielding ~1Gb of metatranscriptomic reads with lengths of 50 to 1600 bases (nt) (652 nt average). Raw sequence reads were trimmed to remove nucleotides derived from the amplification primers using a custom application. Below follows an outline of the main steps we followed to create the uploaded databases: I.Sequences were compared locally to a combined nucleotide database (nt16SLep = “Non-redundant” nucleotide sequence (nt) database + 16S rRNA gene (16S) database + Lepidopteran whole genome shotgun (Lep) projects completed at the time of the analysis) using BLASTN (Altschul et al., 1990) with a 1e-50 cutoff E-value, and to the protein database (nr = non-redundant protein sequence) using Diamond (Buchfink et al., 2014) with a 1e-17 cutoff E-value. II.The homology search results were then processed as follows: Step A: The output files from both homology searches were processed with MEGAN, a software which performs taxonomic binning and assigns sequences to taxa using the Lowest Common Ancestor (LCA)-assignment algorithm (Huson et al., 2007). Taxonomic and functional assignments performed by MEGAN for each sequence were then exported using a MEGAN functionality. Note: MEGAN computes a “species profile” by finding the lowest node in the NCBI taxonomy that encompasses the set of hit taxa and assigns the sequence to the taxon represented by that lowest node. With this approach, every sequence is assigned to some taxon; if the sequence aligns very specifically only to a single taxon, then it is assigned to that taxon; the less specifically a sequence hits taxa, the higher up in the taxonomy it is placed. Step B: The output files from both homology searches were also processed with a custom bash script. This script parses the homology search output files and generates two files (one for each homology search) containing the name of each sequence, its best hit (or no hit) and the corresponding E-value. III. Create local database (Step C): All this information (from the exported MEGAN files and from the bash script output files) was then used to create a local SQLite database which included all the available information for each sequence (from both homology searches).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of the EST-SSRs that were identified in the transcriptome.
Transcriptomic information (spatiotemporal gene expression profile data) on the postnatal cerebellar development of mice (C57B/6J & ICR). It is a tool for mining cerebellar genes and gene expression, and provides a portal to relevant bioinformatics links. The mouse cerebellar circuit develops through a series of cellular and morphological events, including neuronal proliferation and migration, axonogenesis, dendritogenesis, and synaptogenesis, all within three weeks after birth, and each event is controlled by a specific gene group whose expression profile must be encoded in the genome. To elucidate the genetic basis of cerebellar circuit development, CDT-DB analyzes spatiotemporal gene expression by using in situ hybridization (ISH) for cellular resolution and by using fluorescence differential display and microarrays (GeneChip) for developmental time series resolution. The CDT-DB not only provides a cross-search function for large amounts of experimental data (ISH brain images, GeneChip graph, RT-PCR gel images), but also includes a portal function by which all registered genes have been provided with hyperlinks to websites of many relevant bioinformatics regarding gene ontology, genome, proteins, pathways, cell functions, and publications. Thus, the CDT-DB is a useful tool for mining potentially important genes based on characteristic expression profiles in particular cell types or during a particular time window in developing mouse brains.
A specialized database for human alternative splicing (AS) based on H-Invitational full-length cDNAs. H-DBAS offers unique data and viewer for human Alternative Splicing (AS) analysis. It contains: * Genome-wide representative alternative splicing variants (RASVs) identified from following datasets * H-Inv full-length cDNAs (resource summary): H-Invitational cDNA dataset * H-Inv all transcripts (resource summary): Published human mRNA dataset * Mouse full-length cDNAs (resource summary): Mouse cDNA dataset * RASVs affecting protein functions such as protein motif, GO, subcellular localization signal and transmembrane domain * Conserved RASVs compared with mouse genome and the full-length cDNAs (H-Inv full-length cDNAs only)
TSA is an archive of computationally assembled transcript sequences from primary data such as ESTs and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from GenBank records because there are no physical counterparts to the assemblies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of functional annotation of the assembled unigenes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Table of contents
This repository contains the data that support the findings of the manuscript
"TOA: a software package for automated functional annotation in non-model plant species".
Directories:
benchmark-transcriptomes: Fasta files with the sequences corresponding to the benchmark transcriptomes tested in the manuscript:
Cnuc: Cocos nucifera embryos (Huang et al., 2014).
Fsyl: Fagus sylvatica leaves (Müller, Seifert, Lübbe, Leuschner, & Finkeldey, 2017).
Pcan: Pinus canariensis immature xylem (Chano, Collada, & Soto, 2017).
TOA: output of the six simulation tests performed with TOA.
EnTAP: output of the six simulation tests performed with EnTAP.
Trinotate: output of the three simulation tests performed with Trinotate.
References:
Chano, V., Collada, C., & Soto, A. (2017). Transcriptomic analysis of wound xylem formation in Pinus canariensis. BMC Plant Biology, 17(234). doi:10.1186/s12870-017-1183-3
Huang, Y. Y., Lee, C. P., Fu, J. L., Chang, B. C. H., Matzke, A. J. M., & Matzke, M. (2014). De novo transcriptome sequence assembly from coconut leaves and seeds with a focus on factors involved in RNA-directed DNA methylation. G3: Genes, Genomes, Genetics, 4(11), 2147–2157. doi:10.1534/g3.114.013409
Müller, M., Seifert, S., Lübbe, T., Leuschner, C., & Finkeldey, R. (2017). De novo transcriptome assembly and analysis of differential gene expression in response to drought in European beech. PLoS ONE, 12(9), 1–20. doi:10.1371/journal.pone.0184167
A Transcriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Understanding Brain Development and Function Understanding the cell-cell interactions that control CNS development and function has long been limited by the lack of methods to cleanly separate astrocytes, neurons, and oligodendrocytes. Here we describe the first method for the isolation and purification of developing and mature astrocytes from mouse forebrain. This method takes advantage of the expression of S100β by astrocytes. We used fluorescent activated cell sorting (FACS) to isolate EGFP positive cells from transgenic mice that express EGFP under the control of an S100β promoter. By depletion of astrocytes and oligodendrocytes we obtained purified populations of neurons, while by panning with oligodendrocyte-specific antibodies we obtained purified populations of oligodendrocytes. Using GeneChip Arrays we then created a transcriptome database of the expression levels of over 20,000 genes by gene profiling these three main CNS neural cell types at postnatal ages day 1 to 30. This database provides the first global characterization of the genes expressed by mammalian astrocytes in vivo and is the first direct comparison between the astrocyte, neuron, and oligodendrocyte transcriptomes. We demonstrate that Aldh1L1, a highly expressed astrocyte gene, is a highly specific antigenic marker for astrocytes with a substantially broader, and therefore potentially more useful, pattern of astrocyte expression than the traditional astrocyte marker GFAP. This transcriptome database of acutely isolated and highly pure populations of astrocytes, neurons and oligodendrocytes provides a resource to the neuroscience community by providing improved cell type specific markers and for better understanding of neural development, function, and disease. We acutely purified mouse astrocytes from early postnatal ages (P1) to later postnatal ages (P30), when astrocyte differentiation is morphologically complete (Bushong et al., 2004), and acutely purified mouse OL-lineage cells from stages ranging from OPCs to newly differentiated OLs to myelinating OLs. We extracted RNA from each of these highly purified, acutely isolated cell types and used GeneChip Arrays to determine the expression levels of over 20,000 genes and construct a comprehensive database of cell type specific gene expression in the mouse forebrain. Analysis of this database confirms cell type specific expression of many well characterized and functionally important genes. In addition, we have identified thousands of new cell type enriched genes, thereby providing important new information about astrocyte, OL, and neuron interactions, metabolism, development, and function. This database provides a comparison of the genome-wide transcriptional profiles of the main CNS cell types and is a resource to the neuroscience community for better understanding the development, physiology, and pathology of the CNS. Keywords: Developmental CNS Cell type comparision FACS purification of astrocytes: Dissociated forebrains from S100β-EGFP mice were resuspended in panning buffer (DBPS containing 0.02% BSA and 12.5 U/ml DNase) and sequentially incubated on the following panning plates: secondary antibody only plate to deplete microglia, O4 plate to deplete OLs, PDGFRα plate to deplete OPCs, and a second O4 plate to deplete any remaining OLs. This procedure was sufficient to deplete all OL-lineage cells from animals P8 and younger, however, in older animals that had begun to myelinate, additional depletion of OLs and myelin debris was accomplished as follows. The nonadherent cells from the last O4 dish were harvested by centrifugation, and the cells were resuspended in panning buffer containing GalC, MOG, and O1 supernatant and incubated for 15 minutes at room temperature. The cell suspension was washed and then resuspended in panning buffer containing 20 μg donkey anti-mouse APC for 15 minutes. The cells were washed and resuspended in panning buffer containing propidium iodide (PI). EGFP+ astrocytes were then purified by fluorescence activated cell sorting (FACS). Dead cells were gated out using high PI staining and forward light scatter. Astrocytes were identified based on high EGFP fluorescence and negative APC fluorescence from indirect immunostaining for OL markers GalC, MOG, and O1. Cells were sorted twice and routinely yielded >99.5% purity based on reanalysis of double sorted cells.; FACS purification of neurons: EGFP- cells were the remaining forebrain cells after microglia, OLs, and astrocytes had been removed, and were primarily composed of neurons, and to a lesser extent, endothelial cells (we estimate < 4% endothelial cells at P7 and < 20% endothelial cells at P17). EGFP- cells from S100β-EGFP dissociated forebrain were FACS purified in parallel with astrocyte purification and were sorted based on their negative EGFP fluorescence immunofluorescence. Cells were sorted twice and routinely yielded >99.9% purity. In independent preparations, the EGFP- cell population was additionally depleted of endothelial cells and pericytes by sequentially labeling with biotin-BSL1 lectin and streptavidin-APC while also labeling for OL markers as described above. Cells were sorted twice and routinely yielded >99.9% purity.; Panning purification of oligodendrocyte lineage cells: Dissociated mouse forebrains were resuspended in panning buffer. In order to deplete microglia, the single-cell suspension was sequentially panned on four BSL1 panning plates. The cell suspension was then sequentially incubated on two PDGFRα plates (to purify and deplete OPCs), one A2B5 plate (to deplete any remaining OPCs), two MOG plates (to purify and deplete myelinating OLs), and one GalC plate (to purify the remaining PDGFRα-, MOG-, OLs). The adherent cells on the first PDGFRα, MOG, and GalC plates were washed to remove all antigen-negative nonadherent cells. The cells were then lysed while still attached to the panning plate with Qiagen RLT lysis buffer, and total RNA was purified. Purified OPCs were >95% NG2 positive and 0% MOG positive. Purified Myelin OLs were 100% MOG positive, >95% MBP positive, and 0% NG2 positive. Purified GalC OLs depleted of OPCs and Myelin OLs were <10% MOG positive and ~50% weakly NG2 positive, a reflection of their recent development as early OLs.; Data normalization and analysis: Raw image files were processed using Affymetrix GCOS and the MAS 5.0 algorithm. Intensity data was normalized per chip to a target intensity TGT value of 500, and expression data and absent/present calls for individual probe sets were determined. Gene expression values were normalized and modeled across arrays using the dChip software package with invariant-set normalization and a PM model. (www.dchip.org, Li and Wong, 2001). The 29 samples were grouped into 9 sample types: Astros P7-P8, Astros P17, Astros P17-gray matter (P17g), Neurons P7, Neurons P17, Neurons-endothelial cell depleted (P7n, P17n), OPCs, GalC-OLs, and MOG-OLs. Gene filtering was performed to select probe sets that were consistently expressed in at least one cell type, where consistently expressed was defined as being called present and having a MAS 5.0 intensity level greater than 200 in at least two-thirds of the samples in the cell type. We identified 20,932 of the 45,037 probe sets that were consistently expressed in at least one of the nine cell types. The Significance Analysis of Microarrays (SAM) method (Tusher et al., 2001) was used to determine genes that were significantly differentially expressed between different cell types (see Supplemental Table S2 for SAM cell type groupings). Clustering was performed using the hclust method with complete linkage in R. Expression values were transformed for clustering by computing a mean expression value for the gene using those samples in the corresponding SAM statistical analysis, and then subtracting the mean from expression intensities. In order to preserve the log2 scale of the data, unless otherwise indicated, no normalization by variance was performed. Plots were created using the gplots package in R. The Bioconductor software package (Gentleman et al., 2004) was used throughout the expression analyses. Functional analyses were performed through the use of Ingenuity Pathways Analysis (Ingenuity® Systems, www.ingenuity.com).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Alternative splicing can lead to distinct protein isoforms. These can have different functions in specific cells and tissues or in different developmental stages. In this study, we explored whether transcripts assembled from long read, nanopore-based, direct RNA-sequencing (RNA-seq) could improve the identification of protein isoforms in human K562 cells. By comparing with Illumina-based short read RNA-seq, we showed that a large proportion of Ensembl transcripts (5949/14,326) and genes expressing alternatively spliced transcripts (486/2981) identified with long direct reads were missed by short paired-end reads. By co-analyzing proteomic and transcriptomic data, we also showed that some peptides (826/35,976), proteins (262/3215), and protein isoforms arising from distinct transcript variants (574/1212) identified with isoform-specific peptides via custom long-read-based databases were missed in Illumina-derived databases. Finally, we generated unequivocal peptide evidence for a set of protein isoforms and showed that long read, direct RNA-seq allows the discovery of novel protein isoforms not already in reference databases or custom databases built from short read RNA-seq data. Our analysis highlights the benefits of long read RNA-seq data in the generation of reference databases to increase tandem mass spectrometry (MS/MS) identification of protein isoforms.
Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third d...
Purpose: To better understand the function of the various cell types of the brain, we prospectively purified neurons, astrocytes, oligodendrocyte precursor cells, newly formed oligodendrocytes, myelinating oligodendrocytes, microglia, endothelial cells, and pericytes from mouse cerebral cortex. We generated a transcriptome database for these 8 cell types by RNA sequencing and used a highly sensitive algorithm to detect alternative splicing events in each gene. Our analysis identified thousands of new cell type enriched genes and splicing isoforms that will provide novel markers for cell identification, new tools for genetic manipulation, and numerous insights into the biology of the brain. Method Part1:To purify astrocytes, we took advantage of a BAC transgenic mouse line expressing EGFP under the control of regulatory sequences in Aldh1l1-BAC. This line has been previously characterized to have complete astrocyte-specific labeling throughout the brain. Cells from a litter of 8-16 P7 Aldh1l1-EGFP transgenic mice of both genders were pooled together as one biological replicate. The cortices were dissected out and meninges were removed. The tissue was enzymatically dissociated to make a suspension of single cells as described previously. Briefly, the tissue was incubated at 33 °C for 45 minutes in 20 ml of a papain solution containing Earle’s balanced salts (EBSS, Sigma, St. Louis, MO, E7510), D(+)-glucose (22.5mM), NaHCO3 (26mM), DNase (125U/ml, Worthington, Lakewood, NJ, LS002007), papain (9 U/ml, Worthington, Lakewood, NJ, LS03126), and L-cysteine (1mM, Sigma, St. Louis, MO, C7880). The papain solution was equilibrated with 5% CO2 and 95% O2 gas before and during papain treatment. Following papain treatment, the tissue was washed three times with 4.5ml of inhibitor buffer containing BSA (1.0mg/ml, Sigma, St. Louis, MO, A-8806), and ovomucoid (also known as trypsin inhibitor, 1.0 mg/ml, Roche Diagnostics Corporation, Indianapolis, IN 109878) and then mechanically dissociated by gentle sequential trituration using a 5ml pipette. Dissociated cells were layered on top of 10ml of high concentration inhibitor solution with 5mg/ml BSA and 5mg/ml ovomucoid and centrifuged at 130g for 5 minutes. The cell pellet was then resuspended in 12 ml Dulbecco’s phosphate-buffered saline (DPBS, Invitrogen, Carlsbad, CA 14287) containing 0.02% BSA and 12.5U/ml DNase and filtered through a 20um Nitex mesh (Sefar America Inc., Depew NY, Lab Pak 03-20/14) to remove undissociated cell clumps. This yields a single cell suspension. Cell health is assessed by trypan blue exclusion. Only single cell suspensions with >85% viability were used for purification experiments. 1μg/ml propidium iodide (PI, Sigma, St. Louis, MO, P4864) was added to the single cell solution to label dead cells. Cells were sorted on a BD Aria II cell sorter (BD Bioscience) with a 70μm nozzle. Dead cells and debris were gated first by their low forward light scatter and high side light scatter and secondly by high PI staining. Doublets were removed by high side light scatter. Cell concentration and flow rate were carefully adjusted to maximize purity. Astrocytes were identified based on high EGFP fluorescence. FACS routinely yielded >99% purity based on reanalysis of sorted cells. Method Part2:To purify neurons, a single cell suspension was prepared as described above and incubated at 34 °C for one hour to allow expression of cell surface protein antigens digested by papain, and then incubated on two sequential panning plates coated with BSL-1 to deplete endothelial cells (10 minutes each), followed by a 30 minute incuation on a plate coated with mouse IgM anti-O4 hybridoma (Bansal et al., 1989. 4ml hybridoma supernatant diluted with 8ml DPBS/0.2% BSA) to deplete OPCs, and then incubated for 20 minutes on a plate coated with rat anti-mouse CD45 (BD Pharmingen 550539, 1.25ug in 12ml of DPBS/0.2% BSA) to deplete microglia and macrophages. Finally cells were added to a plate coated with rat anti-mouse L1CAM (30ug in 12ml of DPBS/0.2% BSA, Millipore, Billerica, MA, MAB5272) to bind neurons. The adherent cells on the L1CAM plate were washed 8 times with 10-20 ml of DPBS to remove all antigen-negative nonadherent cells, and then removed from the plate by treating with trypsin (Sigma, 1,000U/ml, T-4665) in 8ml Ca2+ and Mg2+ free EBSS (Irvine Scientific, Santa Ana, CA, 9208) for 3-10 minutes at 37°C in a 10% CO2 incubator. The trypsin was then neutralized with 20ml of fetal calf serum (FCS) solution containing 30% FCS (Gibco, 10437-028), 35% Dulbecco’s modified eagle medium (DMEM, Invitrogen, 11960-044), and 35% Neurobasal (Gibco, 21103-049). The cells were dislodged by gentle squirting of FCS solution over the plate and harvested by centrifugation at 200g for 10 minutes. Method Part3:To purify microglia and oligodendrocyte-lineage cells, the mice were first perfused with 10ml PBS to remove macrophage contamination from the brain. A single cell suspension was then pr...
THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 16, 2013. Database and customized tools to study the PFAM protein domain content of the transcriptome for all expressed genes of Homo sapiens, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans tethered to both a genomics array repository database and a range of external information resources. GeneSpeed has merged information from several existing data sets including the Gene Ontology Consortium, InterPro, Pfam, Unigene, as well as micro-array datasets. GeneSpeed is a database of PFAM domain homology contained within Unigene. Because Unigene is a non-redundant dbEST database, this provides a wide encompassing overview of the domain content of the expressed transcriptome. We have structured the GeneSpeed Database to include a rich toolset allowing the investigator to study all domain homology, no matter how remote. As a result, homology cutoff score decisions are determined by the scientist, not by a computer algorithm. This quality is one of the novel defining features of the GeneSpeed database giving the user complete control of database content. In addition to a domain content toolset, GeneSpeed provides an assortment of links to external databases, a unique and manually curated Transcription Factor Classification list, as well as links to our newly evolving GeneSpeed BetaCell Database. GeneSpeed BetaCell is a micro-array depository combined with custom array analysis tools created with an emphasis around the meta analysis of developmental time series micro-array datasets and their significance in pancreatic beta cells.
Database of transcriptional start sites (TSSs) representing exact positions in the genome based on a unique experimentally validated TSS sequencing method, TSS Seq. A major part of human adult and embryonic tissues are covered. DBTSS contains 491 million TSS tag sequences collected from a total of 20 tissues and 7 cell cultures. Also integrated is generated RNA-seq data of subcellular- fractionated RNAs and ChIP Seq data of histone modifications, RNA polymerase II and several transcriptional regulatory factors in cultured cell lines. Also included is external epigenomic data, such as chromatin map of the ENCODE project. They associated those TSS information with public and original SNV data, in order to identify single nucleotide variations (SNVs) in the regulatory regions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each row represents one study.
To study the impact of wheat streak mosaic virus on global gene expression in wheat curl mite, we generated a de novo transcriptome assembly using 50 x 50 paired end reads from the Illumina HiSeq 2500. Reads were assembled using Trinity (version 2.0.6) and contigs greater than 200 nt were retained. All assembled transcripts were annotated using the Trinotate pipeline using blastp searches against the Swiss-prot/Uni-Prot database, blastx searches against the Swiss-prot/Uni-Prot databases, HMM searches against the Pfam-A database, blastp searches against the non-redundant protein database, and signalP and tmHMM predictions. To reduce noise from low abundance transcripts not well supported by the data, we filtered the assembly to retain only those transcripts with TPM values >=0.5. Resources in this dataset:Resource Title: Raw Trinity Assembly. File Name: Trinity.fasta.txtResource Description: Raw trinity assembly obtained from wheat curl mite using 50 x 50 Illumina paired end reads from the HiSeq2500.Resource Software Recommended: Notepad++,url: https://notepad-plus-plus.org/ Resource Title: Raw Trinity Assembly. File Name: Trinity.fasta.txtResource Description: Raw trinity assembly obtained from wheat curl mite using 50 x 50 Illumina paired end reads from the HiSeq2500.Resource Software Recommended: Text wrangler,url: https://itunes.apple.com/us/app/textwrangler/id404010395?mt=12 Resource Title: Trinotate annotations for raw Trinity assembly. File Name: trinotate_annotations_report.xlsResource Description: Trinotate results for raw wheat curl mite transcriptome assemblyResource Software Recommended: Excel,url: https://products.office.com/en-us/excel Resource Title: Trinotate annotations for raw Trinity assembly. File Name: trinotate_annotations_report.xlsResource Description: Trinotate results for raw wheat curl mite transcriptome assemblyResource Software Recommended: Libre Office Calc,url: https://www.libreoffice.org/discover/calc/ Resource Title: Blastp results versus non-redundant protein database. File Name: wheat_curl_mite_blastp_nr.txtResource Description: Blastp results for protein coding unigenes from raw Trinity transcriptome assembly (wheat curl mite). Output format is default. Resource Software Recommended: Notepad++,url: https://notepad-plus-plus.org/ Resource Title: Blastp results versus non-redundant protein database. File Name: wheat_curl_mite_blastpnr.txtResource Description: Blastp results for protein coding unigenes from raw Trinity transcriptome assembly (wheat curl mite). Output format is default. Resource Software Recommended: Text wrangler,url: https://itunes.apple.com/us/app/textwrangler/id404010395?mt=12 Resource Title: Protein predictions for raw trinity transcriptome assembly (wheat curl mite). File Name: transcriptome.all.cds.pep.fasta.txtResource Description: Putative coding regions were predicted using Transdecoder. Default parameters were used in conjunction with Pfam-A searches to identify putative open reading frames (ORFs).Resource Title: Protein predictions for final transcriptome assembly (wheat curl mite). File Name: transcriptome.all.cds.pep.fasta.txtResource Description: Protein coding regions were predicted using Transdecoder. ORFs were identified using default parameters in conjunction with Pfam-A searches. Resource Software Recommended: Notepad++,url: https://notepad-plus-plus.org/ Resource Title: Protein predictions for final transcriptome assembly (wheat curl mite). File Name: transcriptome.all.cds.pep.fasta.txtResource Description: Protein coding regions were predicted using Transdecoder. ORFs were identified using default parameters in conjunction with Pfam-A searches. Resource Software Recommended: Text wrangler,url: https://itunes.apple.com/us/app/textwrangler/id404010395?mt=12 Resource Title: Final trinity transcriptome assembly for wheat curl mite. File Name: Trinity.mite.fasta.txtResource Description: Transcripts less than 200 nt and transcripts with TPM values less than 0.5 were removed from the assembly. In addition, transcripts whose coding sequences had highest scoring blastp matches to microbes were also removed from the assembly.Resource Title: Nucleotide coding regions for final transcriptome assembly for wheat curl mite. File Name: transcriptome.mite.cds.fasta.txtResource Description: Nucleotide sequences corresponding to coding regions from the final transcriptome assembly for wheat curl mite. Open reading frames (ORFs) were predicted using transdecoder. Default parameters with the addition of the identification of Pfam-A domains was used for ORF identification.Resource Title: Trinotate annotations for final Trinity assembly (wheat curl mite). File Name: trinotate.mite.xlsResource Description: Trinotate results for final wheat curl mite transcritpome assembly. Blastp and blastx searches against Swiss-Prot/Uni-Prot were performed along with Pfam-A searches using HMMER. Signal peptides and transmembrane domains were also identified. Resource Software Recommended: Excel,url: https://products.office.com/en-us/excel Resource Title: Trinotate annotations for final Trinity assembly (wheat curl mite). File Name: trinotate.mite.xlsResource Description: Trinotate results for final wheat curl mite transcritpome assembly. Blastp and blastx searches against Swiss-Prot/Uni-Prot were performed along with Pfam-A searches using HMMER. Signal peptides and transmembrane domains were also identified. Resource Software Recommended: Libre Office Calc,url: https://www.libreoffice.org/discover/calc/
A publicly available database of Transposed elements (TEs) which are located within protein-coding genes of 7 organisms: human, mouse, chicken, zebrafish, fruilt fly, nematode and sea squirt. Using TranspoGene the user can learn about the many aspects of the effect these TEs have on their hosting genes, such as: exonization events (including alternative splicing-related data), insertion of TEs into introns, exons, and promoters, specific location of the TE over the gene, evolutionary divergence of the TE from its consensus sequence and involvement in diseases. TranspoGene database is quickly searchable through its website, enables many kinds of searches and is available for download. TranspoGene contains information regarding specific type and family of the TEs, genomic and mRNA location, sequence, supporting transcript accession and alignment to the TE consensus sequence. The database also contains host gene specific data: gene name, genomic location, Swiss-Prot and RefSeq accessions, diseases associated with the gene and splicing pattern. The TranspoGene and microTranspoGene databases can be used by researchers interested in the effect of TE insertion on the eukaryotic transcriptome.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These databases consolidate a variety of datasets related to the model organism Ruegeria pomeroyi DSS-3. The data were primarily generated by members of the Moran Lab at the University of Georgia, and put together in this format using anvi'o v7.1-dev through the collaborative efforts of Zac Cooper, Sam Miller, and Iva Veseli (special thanks to Christa Smith and Lidimarie Trujillo Rodriguez for their help with gene annotations). The data includes:
(R_POM_DSS3-contigs.db) the complete genome and megaplasmid sequence of R. pomeroyi, along with highly-curated gene annotations established by the Moran Lab and automatically-generated annotations from NCBI COGs, KEGG KOfam/BRITE, Pfams, and anvi'o single-copy core gene sets. It also contains annotations for the Moran Lab's TnSeq mutant library (https://doi.org/10.1101/2022.09.11.507510; https://doi.org/10.1038/s43705-023-00244-6).
(PROFILE-VER_01.db) read-mapping data from multiple transcriptome and metatranscriptome samples generated by the Moran lab to the R. pomeroyi genome. Some coverage data is stored in the AUXILIARY-DATA.db file. This data can be visualized using anvi-interactive. Publicly-available samples are labeled with their SRA accession number.
(DEFAULT-EVERYTHING.db) gene-level coverage data from the transcriptome and meta-transcriptomes samples stored in the profile database, as well as per-gene normalized spectral abundance counts from proteomes matched to a subset of the transcriptomes and gene mutant fitness data from https://doi.org/10.1073/pnas.2217200120. This data can also be visualized using anvi-interactive (see instructions below). The proteome data layers are labeled according to their matching transcriptome samples.
(R_pom_reproducible_workflow.md) a reproducible workflow describing how the databases were generated.
Please note that using these databases requires the development version of anvi'o v8-dev
, or a later version of anvi'o if available. They are not usable with anvi'o v8
or earlier.
Instructions for visualizing the genes database in the anvi'o interactive interface: Anvi'o expects genes databases to be located in a folder called GENES
, so in order to use the specific database included in this datapack, you must move it to the expected location by running the following commands in your terminal:
mkdir GENESmv DEFAULT-EVERYTHING.db GENES/
Once that is done, you can use the following command to visualize the gene-level information:
anvi-interactive -c R_POM_DSS3-contigs.db -p PROFILE-VER_01.db -C DEFAULT -b EVERYTHING --gene-mode
To view only the proteomic data and its matched transcriptomes, you can add the flag --state-autoload proteomes
to the above command.
To view all transcriptomes and the proteomes organized by study of origin, you can add the flag --state-autoload figure
to the above command.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Yellowhorn (Xanthoceras sorbifolium Bunge), a deciduous shrub or small tree native to north China, is of great economic values. Seeds of yellowhorn are rich in oil containing unsaturated long chain fatty acids that have been used for producing edible oil and nervonic acid capsule. However, the lack of a high-quality genome sequence hampers the understanding of its evolution and gene functions.
In this study, a whole-genome of yellowhorn was sequenced and assembled by integration of Illumina sequencing, PacBio single-molecule real-time sequencing, 10X Genomics link-reads, Bionano optical maps and Hi-C. The yellowhorn genome assembly was 439.97 Mb, which comprised of 15 pseudo-chromosomes covering 95.42% (419.84 Mb) of the genome. The repetitive fractions accounted for 56.39% of yellowhorn genome. The genome contained 21,059 protein coding genes. Of them, 18,503 (87.46%) genes were functionally annotated at least one term by searching against the other databases. Transcriptomic analysis showed that 341, 113, 100, 135 and 125 genes were specifically expressed in leaf, hermaphrodite flower, shoot, staminate flower and young fruit, respectively. Phylogenetic analysis suggested that yellowhorn diverged from the common ancestral of Dimocarpus longan approximately 58.63 million years ago.
The availability and subsequent annotation of yellowhorn genome, as well as the identification of tissue-specific functional genes, provides a valuable reference for plant comparative genomics, evolutionary studies and molecular design breeding.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Version 1.0.0 (12/28/2022)
This is a drug repurposing dataset under MIT licence, compiled by Dr. Clémence Réda
# drugs | # diseases | Sparsity number | # positive associations | # negative associations | # genes
------- | ---------- | --------------- | ----------------------- | ----------------------- | -------
871 | 144 | 0.76% | 773 | 181 | 10,811
All drugs (resp., diseases) are associated with a gene expression feature vector of length 10,811 (that is, all drugs and diseases in the feature matrices appear in the association matrix, and vice versa). However, some diseases/drugs are not necessarily involved in negative or positive associations (meaning that all pairs with those items have an association value of 0).
----------
This dataset consists of three .CSV files:
* Drug-Disease Association Matrix
1. "ratings_mat.csv"
This matrix contains values in {-1,0,1} where -1 stands for a negative association (i.e., the drug failed for some reason to treat the considered disease: e.g., lack of accrual in the associated clinical trial, or proven toxicity), 1 for a positive association (i.e., the drug was shown to treat the disease), and 0 for unknown associated status. The columns are diseases, identified by their MedGen Concept ID, whereas rows are drugs, identified by their DrugBank IDs or PubChem CIDs.
* Drug Feature Matrix
1. "items.csv"
This matrix has drugs in its columns, identified by their DrugBank IDs or PubChem CIDs, and genes in its rows, identified by their HUGO Gene Symbol. Genewise transcriptomic variation induced by drug treatment, from the CREEDS or the LINCS L1000 databases.
* Disease Feature Matrix
1. "users.csv"
This matrix has diseases in its columns, identified by their MedGen Concept IDs, and genes in its rows, identified by their HUGO Gene Symbol. Genewise transcriptomic variation induced by the disease, from the CREEDS database.
----------
Further information about the generation of those matrices is available by running the Jupyter notebook TRANSCRIPT_dataset-v1.0.0.ipynb on the following GitHub repository: https://github.com/RECeSS-EU-Project/drug-repurposing-datasets. For any questions, please contact the author at
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background Body plan development in multi-cellular organisms is largely determined by homeotic genes. Expression of homeotic genes, in turn, is partially regulated by insulator binding proteins (IBPs). While only a few enhancer blocking IBPs have been identified in vertebrates, the common fruit fly Drosophila melanogaster harbors at least twelve different enhancer blocking IBPs. We screened recently compiled insect transcriptomes from the 1KITE project and genomic and transcriptomic data from public databases, aiming to trace the origin of IBPs in insects and other arthropods.
Results Our study shows that the last common ancestor of insects (Hexapoda) already possessed a substantial number of IBPs. Specifically, of the known twelve insect IBPs, at least three (i.e., CP190, Su(Hw), and CTCF) already existed prior to the evolution of insects. Furthermore we found GAF orthologs in early branching insect orders, including Zygentoma (silverfish and firebrats) and Diplura (two-pronged bristletails). Mod(mdg4) is most likely a derived feature of Neoptera, while Pita is likely an evolutionary novelty of holometabolous insects. Zw5 appears to be restricted to schizophoran flies, whereas BEAF-32, ZIPIC and the Elba complex, are probably unique to the genus Drosophila. Selection models indicate that insect IBPs evolved under neutral or purifying selection.
Conclusions Our results suggest that a substantial number of IBPs either pre-date the evolution of insects or evolved early during insect evolution. This suggests an evolutionary history of insulator binding proteins in insects different to that previously thought. Moreover, our study demonstrates the versatility of the 1KITE transcriptomic data for comparative analyses in insects and other arthropods.
Spodoptera frugiperda is a noctuid moth that devastates various crops including corn, rice and cotton, and is found in most of the American continent. The purpose of this study was to integrate gene expression data from S. frugiperda guts and their associated metatranscriptomes, under natural and controlled conditions. For this, four S. frugiperda samples from the province of Tucumán (Argentina; subtropical region) were analysed. Specimens were obtained from different environments, altitudes and food sources, namely: 1) a transgenic maize (Zea mays) field at 495 m.a.s.l. where insecticides and fertilisers were applied (named MM; 26o49’50”S; 65o16’59.4”W); 2) Sorghum halepense at 495 m.a.s.l. (MS; 26o49’50”S; 65o16’59.4”W); 3) a maize field at 2283 m.a.s.l. where no insecticides or fertilisers were used (TV; 26o55’40.75”S; 65o45’19.90”W) ; and 4) a colony established from larvae originally collected from the same transgenic maize field as Sf_MM, reared for 9 generations under controlled conditions on an artificial diet adapted from [8], without the addition of antibiotics (BT). For all samples, total RNA extracted from fifth instar larvae guts (two digestive tracts per sample), was submitted to a modified one-step reverse transcription and polymerase chain reaction sequence-independent amplification procedure, as described previously. High-throughput pyrosequencing of the samples was performed using a Roche GS FLX (Macrogen Inc., Korea), yielding ~1Gb of metatranscriptomic reads with lengths of 50 to 1600 bases (nt) (652 nt average). Raw sequence reads were trimmed to remove nucleotides derived from the amplification primers using a custom application. Below follows an outline of the main steps we followed to create the uploaded databases: I.Sequences were compared locally to a combined nucleotide database (nt16SLep = “Non-redundant” nucleotide sequence (nt) database + 16S rRNA gene (16S) database + Lepidopteran whole genome shotgun (Lep) projects completed at the time of the analysis) using BLASTN (Altschul et al., 1990) with a 1e-50 cutoff E-value, and to the protein database (nr = non-redundant protein sequence) using Diamond (Buchfink et al., 2014) with a 1e-17 cutoff E-value. II.The homology search results were then processed as follows: Step A: The output files from both homology searches were processed with MEGAN, a software which performs taxonomic binning and assigns sequences to taxa using the Lowest Common Ancestor (LCA)-assignment algorithm (Huson et al., 2007). Taxonomic and functional assignments performed by MEGAN for each sequence were then exported using a MEGAN functionality. Note: MEGAN computes a “species profile” by finding the lowest node in the NCBI taxonomy that encompasses the set of hit taxa and assigns the sequence to the taxon represented by that lowest node. With this approach, every sequence is assigned to some taxon; if the sequence aligns very specifically only to a single taxon, then it is assigned to that taxon; the less specifically a sequence hits taxa, the higher up in the taxonomy it is placed. Step B: The output files from both homology searches were also processed with a custom bash script. This script parses the homology search output files and generates two files (one for each homology search) containing the name of each sequence, its best hit (or no hit) and the corresponding E-value. III. Create local database (Step C): All this information (from the exported MEGAN files and from the bash script output files) was then used to create a local SQLite database which included all the available information for each sequence (from both homology searches).