Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
100 microarray and RNA-seq gene expression datasets from five model species (human, mouse, fruit fly, arabidopsis plants, and baker's yeast). These datasets represent the benchmark set that was used to test our clust clustering method and to compare it with five widely used clustering methods (MCL, k-means, hierarchical clustering, WGCNA, and self-organising maps). This data resource includes raw data files, pre-processed data files, clustering results, clustering results evaluation, and scripts.
The files are split into three zipped parts, 100Datasets_part_1.zip, 100Datasets_part_2.zip, and 100Datasets_part_3.zip. The contents of the three zipped files should be extracted to a single folder (e.g. 100Datasets).
Below is a thorough description of the files and folders in this data resource.
Scripts
The scripts used to apply each one of the clustering methods to each one of the 100 datasets and to evaluate their results are all included in the folder (scripts/).
Datasets and clustering results (folders starting with D)
The datasets are labelled as D001 to D100. Each dataset has two folders: D###/ and D###_Res/, where ### is the number of the dataset. The first folder only includes the raw dataset while the second folder includes the results of applying the clustering methods to that dataset. The files ending with _B.tsv include clustering results in the form of a partition matrix. The files ending with _E include metrics evaluating the clustering results. The files ending with _go and _go_E respectively include the enriched GO terms in the clustering results and evaluation metrics of these GO terms.
Simultaneous analysis of multiple datasets (folders starting with MD)
As our clust method is design to be able to extract clusters from multiple datasets simultaneously, we also tested it over multiple datasets. All folders starting with MD_ are related to "multiple datasets (MD)" results. Each MD experiment simultaneously analyses d randomly selected datasets either out of a set of 10 arabidopsis datasets or out of a set of 10 yeast datasets. For each one of the two species, all d values from 2 to 10 were tested, and at each one of these d values, 10 different runs were conducted, where at each run a different subset of d datasets is selected randomly.
The folders MD_10A and MD_10Y include the full sets of 10 arabidposis or 10 yeast datasets, respectively. Each folder with the format MD_10#_d#_Res## includes the results of applying the six clustering methods at one of the 10 random runs of one of the selected d values. For example, the "MD_10A_d4_Res03/" folder includes the clustering results of the 3rd random selection of 4 arabidopsis datasets (the letter A in the folder's name refers to arabidopsis).
Our clust method is applied directly over multiple datasets where each dataset is in a separate data file. Each "MD_10#_d#_Res##" folder includes these individual files in a sub-folder named "Processed_Data/". However, the other clustering methods only accept a single input data file. Therefore, the datasets are merged first before being submitted to these methods. Each "MD_10#_d#_Res##" folder includes a file "X_merged.tsv" for the merged data.
Evaluation metrics (folders starting with Metrics)
Each clustering results folder (D##_Res or MD_10#_d#_Res##) includes some clustering evaluation files ending with _E. This information is combined into tables for all datasets, and these tables appear in the folders starting with "Metrics_".
Other files and folders
The GO folder includes the reference GO term annotations for arabidopsis and yeast. The Datasets file includes a TAB delimited table describing the 100 datasets. The SearchCriterion file includes the objective methodology of searching the NCBI database to select these 100 datasets. The Specials file includes some special considerations for couple of datasets that differ a bit from what is described in the SearchCriterion file. The Norm### files and the files in the Reps/ folder describe normalisation codes and replicate structures for the datasets and were fed to the clust method as inputs. The Plots/ folder includes plots of the gene expression profiles of the individual genes in the clusters generated by each one of the 6 methods over each one of the 100 datasets. Only up to 14 clusters per method are plotted.
OBJECTIVE: To investigate the differentially expressed genes related to the chemosensitivity of laryngeal squamous cell carcinoma (LSCC)by microarrays arrays. METHODS: 1. A total number of 11 patients who underwent induction chemotherapy for primary hypopharyngeal squamous cell carcinoma (7 patients are sensitive to chemotherapy ,and others are not) were recruited for microarray and miRNA array gene expression analysis 2. Bioinformatics analysis of differentially expressed genes screened by microarrays : The differential gene cluster analysis was applied in biological processes, cellular components and molecular functions by GO database; The differential gene enrichment analysis was applied in signaling pathways by KEGG database, and the differentially expressed and biologically meaningful core genes would be screened. RESULTS: 1. Analyzed by microarrays, there were 1554 genes significantly related to the sensitivity to chemotherapy; Among these 1554genes, 777 showed a higher expression in the tissue from patients who are sensitive to chemotherapy , while 785 presented the contrasting pattern. CONCLUSIONS: The research revealed a gene expression signature of chemosensitivity in laryngeal squamous cell carcinoma by microarrays arrays. The result will contribute to the understanding of the molecular basis of laryngeal squamous cell carcinoma and help to improve diagnosis and treatment. 1. A total number of 11 patients who underwent induction chemotherapy for primary hypopharyngeal squamous cell carcinoma (7 patients are sensitive to chemotherapy ,and others are not) were recruited for microarray and miRNA array gene expression analysis 2. Bioinformatics analysis of differentially expressed genes screened by microarrays : The differential gene cluster analysis was applied in biological processes, cellular components and molecular functions by GO database; The differential gene enrichment analysis was applied in signaling pathways by KEGG database, and the differentially expressed and biologically meaningful core genes would be screened.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MotivationDNA microarray analysis is characterized by obtaining a large number of gene variables from a small number of observations. Cluster analysis is widely used to analyze DNA microarray data to make classification and diagnosis of disease. Because there are so many irrelevant and insignificant genes in a dataset, a feature selection approach must be employed in data analysis. The performance of cluster analysis of this high-throughput data depends on whether the feature selection approach chooses the most relevant genes associated with disease classes.ResultsHere we proposed a new method using multiple Orthogonal Partial Least Squares-Discriminant Analysis (mOPLS-DA) models and S-plots to select the most relevant genes to conduct three-class disease classification and prediction. We tested our method using Golub’s leukemia microarray data. For three classes with subtypes, we proposed hierarchical orthogonal partial least squares-discriminant analysis (OPLS-DA) models and S-plots to select features for two main classes and their subtypes. For three classes in parallel, we employed three OPLS-DA models and S-plots to choose marker genes for each class. The power of feature selection to classify and predict three-class disease was evaluated using cluster analysis. Further, the general performance of our method was tested using four public datasets and compared with those of four other feature selection methods. The results revealed that our method effectively selected the most relevant features for disease classification and prediction, and its performance was better than that of the other methods.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We present a large-scale analysis of mRNA coexpression based on 60 large human data sets containing a total of 3924 microarrays. We sought pairs of genes that were reliably coexpressed (based on the correlation of their expression profiles) in multiple data sets, establishing a high-confidence network of 8805 genes connected by 220,649 “coexpression links” that are observed in at least three data sets. Confirmed positive correlations between genes were much more common than confirmed negative correlations. We show that confirmation of coexpression in multiple data sets is correlated with functional relatedness, and show how cluster analysis of the network can reveal functionally coherent groups of genes. Our findings demonstrate how the large body of accumulated microarray data can be exploited to increase the reliability of inferences about gene function.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mutation cluster analysis is critical for understanding certain mutational mechanisms relevant to genetic disease, diversity, and evolution. Yet, whole genome sequencing for detection of mutation clusters is prohibitive with high cost for most organisms and population surveys. Single nucleotide polymorphism (SNP) genotyping arrays, like the Mouse Diversity Genotyping Array, offer an alternative low-cost, screening for mutations at hundreds of thousands of loci across the genome using experimental designs that permit capture of de novo mutations in any tissue. Formal statistical tools for genome-wide detection of mutation clusters under a microarray probe sampling system are yet to be established. A challenge in the development of statistical methods is that microarray detection of mutation clusters is constrained to select SNP loci captured by probes on the array. This paper develops a Monte Carlo framework for cluster testing and assesses test statistics for capturing potential deviations from spatial randomness which are motivated by, and incorporate, the array design. While null distributions of the test statistics are established under spatial randomness via the homogeneous Poisson process, power performance of the test statistics is evaluated under postulated types of Neyman-Scott clustering processes through Monte Carlo simulation. A new statistic is developed and recommended as a screening tool for mutation cluster detection. The statistic is demonstrated to be excellent in terms of its robustness and power performance, and useful for cluster analysis in settings of missing data. The test statistic can also be generalized to any one dimensional system where every site is observed, such as DNA sequencing data. The paper illustrates how the informal graphical tools for detecting clusters may be misleading. The statistic is used for finding clusters of putative SNP differences in a mixture of different mouse genetic backgrounds and clusters of de novo SNP differences arising between tissues with development and carcinogenesis.
Data analysis server / software designed to test statistical significance of gene microarray data, visualize the results, and provide links to clone information and gene index. Several public datasets are also available.
Bio Resource for array genes is a free online resource for easy access to collective and integrated information from various public biological resources for human, mouse, rat, fly and c. elegans genes. The resource includes information about the genes that are represented in Unigene clusters. This resource provides interactive tools to selectively view, analyze and interpret gene expression patterns against the background of gene and protein functional information. Different query options are provided to mine the biological relationships represented in the underlying database. Search button will take you to the list of query tools available. This Bio resource is a platform designed as an online resource to assist researchers in analyzing results of microarray experiments and developing a biological interpretation of the results. This site is mainly to interpret the unique gene expression patterns found as biological changes that can lead to new diagnostic procedures and drug targets. This interactive site allows users to selectively view a variety of information about gene functions that is stored in an underlying database. Although there are other online resources that provide a comprehensive annotation and summary of genes, this resource differs from these by further enabling researchers to mine biological relationships amongst the genes captured in the database using new query tools. Thus providing a unique way of interpreting the microarray data results based on the knowledge provided for the cellular roles of genes and proteins. A total of six different query tools are provided and each offer different search features, analysis options and different forms of display and visualization of data. The data is collected in relational database from public resources: Unigene, Locus link, OMIM, NCBI dbEST, protein domains from NCBI CDD, Gene Ontology, Pathways (Kegg, Genmapp and Biocarta) and BIND (Protein interactions). Data is dynamically collected and compiled twice a week from public databases. Search options offer capability to organize and cluster genes based on their Interactions in biological pathways, their association with Gene Ontology terms, Tissue/organ specific expression or any other user-chosen functional grouping of genes. A color coding scheme is used to highlight differential gene expression patterns against a background of gene functional information. Concept hierarchies (Anatomy and Diseases) of MESH (Medical Subject Heading) terms are used to organize and display the data related to Tissue specific expression and Diseases. Sponsors: BioRag database is maintained by the Bioinformatics group at Arizona Cancer Center. The material presented here is compiled from different public databases. BioRag is hosted by the Biotechnology Computing Facility of the University of Arizona. 2002,2003 University of Arizona.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of significantly enriched GO functions for the cluster shown in Fig. 7. (XLSX 17 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Microarray analysis results. Clusters generated through K-means clustering in Genesis for the microarray. (XLSX 330 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clusters formed as part of a study to determine the function of kap108 in Saccharomyces cerevisiae. Genes clustered were those that underwent at least a 40% change in differential expression between wild-type and kap108∆ cells following the addition of oxidative stress. Three oxidative time points were considered--10 min oxidative stress, 60 min oxidative stress, and 60 min oxidative stress followed by 60 min no stress--and clustering was done based on how differential gene expression between mutant and wild-type cells changed between time points. A 20% increase was considered "Up", a 20% decrease was considered "Dn", and everything else was considered "Nc". With four time points considered, having three possibilities for each meant 27 possible clusters. This file contains information on all 27 clusters, including the genes within them, graphical depictions of how differential expression changes for each gene over all four time points, a summary of YEASTRACT Gene Ontology analysis for each cluster, and links to full YEASTRACT Gene Ontology analyses.
With the goal of understanding the epigenetic regulation required for germ cell-specific gene expression, we devised a method of DNA methylation analysis adapted for a small number of developing germ cells. This microarray-based method provides the genome-wide assay of DNA methylation using a sub-nanogram quantity of genomic DNA. Using this technique, we obtained DNA methylation profiles for mouse germ cells in various developmental stages including primordial germ cells (PGC) and for stem cells derived from embryos or germ cells. Cluster analysis of the data revealed that each cell type possesses its own characteristic DNA methylation profile, enabling classification of the cell types. This classification is generally consistent with that based on gene expression profiles except for primordial germ cells, whose genome is globally hypomethylated. Among the differentially methylated sites thus identified, we focused on a group of genomic sequences hypomethylated specifically in germline cells. These hypomethylated sequences tend to be clustered, forming large (10 kb to ~9 Mb) genomic domains. Most of these hypomethylated regions designated here as Large Hypomethylated Domain (LoD) correspond to segmentally duplicated regions that contain gene families showing germ cell-specific expression. These include mouse orthologues of human cancer testis antigen genes. Most LoDs appear to be enriched with H3 lysine 9 dimethylation (H3K9me2), usually regarded as a repressive histone modification. It thus appears t hat such a unique epigenomic state (i.e., DNA hypomethylation with H3K9me2 enrichment) may be a prerequisite for the expression of genes contained in these genomic domains. H3K9me2 of GS cells and cumulus
RNA from different tissues of Litopenaeus vannamei was used to test the ability of the microarray to discriminate tissue-specific gene expression profiles Keywords: validation, tissue comparison RNA from circulating hemocytes, gills, hepatopancreas, or muscle (3 biological replicates each) was hybridized to the L. vannamei microarray. The number of hybridizing genes was noted and the relationships among the global profiles in each sample were determined by cluster analysis
Gene expression data portal developed for stem cell community, containing public gene expression datasets derived from microarray, RNA sequencing and single cell profiling technologies. Portal to visualize and download curated stem cell data. Provides easy to use and intuitive tools for biologists to visually explore data, including interactive gene expression profiles, principal component analysis plots and hierarchical clusters, among others.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains example R scripts used for the study presented here. The several sections in the file correspond to the different DE and clustering analyses performed. (TXT)
PAPER 1:"Identification of novel subgroups of high-risk pediatric precursor B acute lymphoblastic leukemia (B-ALL) by unsupervised microarray analysis: clinical correlates and therapeutic implications. A Children's Oncology Group (COG) study."ABSTRACTWe examined gene expression profiles of pre-treatment specimens from 207 patients from the COG P9906 study to identify signatures of children with high risk B-precursor acute lymphoblastic leukemia (ALL) and to determine whether the resulting clusters are associated with either specific clinical features or treatment response characteristics.Four unsupervised clustering methods were utilized to classify patients into similar groups. The different clustering algorithms showed significant overlap in cluster membership. Two clusters contained all cases with either t(1;19)(q23;p13) translocations or MLL rearrangements. The other six clusters were novel and had no recurring chromosomal abnormalities or distinctive clinical features. Members of two of these novel clusters had significant survival differences when compared to the overall 4-year relapse-free survival (RFS) of 61%. These included clusters of patients with either significantly better (94.7%) or worse (21.0%) RFS at 4 years. Children of Hispanic/Latino ethnicity were disproportionately present in the poor outcome cluster. The poor outcome cluster represents a novel biologically distinctive subset of B-precursor ALL that may occur at least as frequently as BCR/ABL. Further molecular characterization of this cluster may lead to the discovery of genomic abnormalities that can be targeted to improve the currently dismal outcome for children with this gene signature.The Sample data have also been used in another study:PAPER 2: "Gene expression classifiers for minimal residual disease and relapse free survival improve outcome prediction and risk classification in children with high risk acute lymphoblastic leukemia. A Children's Oncology Group study".ABSTRACTBackground. Nearly 25% of children with B-precursor ALL present with "high-risk" disease (HR-ALL) that is resistant to current therapies. Gene expression profiling may yield molecular classifiers for outcome prediction that can be used to improve risk classification and therapeutic targeting.Methods. Expression profiles were obtained in pre-treatment leukemic samples from 207 uniformly treated children with HR-ALL. Relapse free survival (RFS) was 61% at 4 years and flow cytometric measures of minimal residual disease (MRD) at the end of induction (day 29) were predictive of outcome (P<0.001). Molecular classifiers predictive of RFS and MRD were developed using extensive cross-validation procedures.Results. A 38 gene molecular risk classifier predictive of RFS (MRC-RFS) distinguished two groups in HR-ALL with different relapse risks: low (4 yr RFS: 81%, n=109) vs. high (4 yr RFS: 50%, n=98) (P<0.0001). In multivariate analysis, the best predictor combined MRC-RFS and day 29 flow MRD data, classifying children into low (87% RFS), intermediate (62% RFS), or high risk (29% RFS) groups (P<0.0001). A 21 gene molecular classifier predictive of MRD could effectively substitute for day 29 flow MRD, yielding a combined classifier that similarly distinguished three risk groups at pre-treatment (low: 82% RFS; intermediate: 63% RFS; and high risk: 45% RFS) (P<0.0001). This combined molecular classifier was further validated on an independent cohort of 84 children with HR-ALL (P = 0.006).Conclusions. Molecular classifiers predictive of RFS and MRD can be used to distinguish distinct prognostic groups within HR-ALL, significantly improving risk classification schemes and the ability to prospectively identify children at diagnosis who will respond to or fail current treatment regimens.NOTE: Due to Children's Oncology Group (COG) restrictions, outcome and MRD data cannot be provided as part of the covariate data for this dataset at the present time. If you would like to arrange individual access to this data, please contact COG or the PI of this study, Dr. Cheryl Willman, at the University of New Mexico Cancer Center (cwillman@unm.edu) to arrange a collaboration. Unsupervised clustering and supervised risk classification analyses of 207 diagnostic samples and associated clinical covariate data.See the Summary for greater details.The data were analyzed using Microarray Suite version 5.0 (MAS 5.0) in the Affymetrix Gene Chip Operating Software Version 1.4. Probe masking was used (see 9906_TT207_Affymetrix_probe_mask.msk, linked below as a supplementary file). Otherwise all Affymetrix default parameter settings were used. Global scaling as the normalization method, with the default target intensity of 500, was used.
Multiple myeloma (MM) is characterized by marked genomic instability. Beyond structural rearrangements, a relevant role in its biology is represented by allelic imbalances leading to significant variations in ploidy status. To better elucidate the genomic complexity of MM, we analyzed a panel of 45 patients using combined FISH and microarray approaches. Using a self-developed procedure to infer exact local copy numbers for each sample, we identified a significant fraction of patients showing marked aneuploidy. A conventional clustering analysis showed that aneuploidy, chromosome 1 alterations, hyperdiploidy and recursive deletions at 1p and chromosomes 13, 14 and 22 were the main aberrations driving samples grouping. Then, we integrated mapping information with gene and microRNAs expression profiles: a multiclass analysis of the identified clusters showed a marked gene-dosage effect, particularly concerning 1q transcripts, also confirmed by correlating gene expression levels and local copy number alterations. A wide dosage effect affected also microRNAs, indicating that structural abnormalities in MM closely reflect in their expression imbalances. Finally, we identified several loci in which genes and microRNAs expression correlated with loss-of-heterozygosity occurrence. Our results provide insights into the composite network linking genome structure and gene/microRNA transcriptional features in MM. Experiment Overall Design: This series of microarray experiments contains the gene expression profiles of purified plasma cells (PCs) obtained from 4 normal donor (N), 11 monoclonal gammopathy of undetermined significance (MGUS), 133 multiple myeloma (MM) and 9 plasma cell leukemia (PCL) at diagnosis. PCs were purified from bone marrow specimens, after red blood cell lysis with 0.86% ammonium chloride, using CD138 immunomagnetic microbeads. The purity of the positively selected PCs was assessed by morphology and flow cytometry and was > 90% in all cases. 5 micrograms of total RNA was processed and, in accordance with the manufacturer's protocols, 15 micrograms of fragmented biotin-labelled cRNA were hybridized on GeneChip Human Genome U133A Arrays (Affymetrix Inc.). The arrays were scanned using the Agilent GeneChip Scanner G2500A. The images were acquired using Affymetrix MicroArray Suite (MAS) 5.0 software and the probe level data converted to expression values using the Bioconductor function for the Robust Multi-Array average (RMA) procedure (Irizarry et al, 2003), in which perfect match intensities are background adjusted, quantile-quantile normalised and log2 transformed.
With the population of older and overweight individuals on the rise in the Western world, there is an ever greater need to slow the aging processes and reduce the burden of age-associated chronic disease that would significantly improve the quality of human life and reduce economic costs. Caloric restriction (CR), is the most robust and reproducible intervention known to delay aging and to improve healthspan and lifespan across species (1); however, whether this intervention can extend lifespan in humans is still unknown. Here we report that rats and humans exhibit similar responses to long-term CR at both the physiological and molecular levels. CR induced broad phenotypic similarities in both species such as reduced body weight, reduced fat mass and increased the ratio of muscle to fat. Likewise, CR evoked similar species-independent responses in the transcriptional profiles of skeletal muscle. This common signature consisted of three key pathways typically associated with improved health and survival: IGF-1/insulin signaling, mitochondrial biogenesis and inflammation. To our knowledge, these are the first results to demonstrate that long-term CR induces a similar transcriptional profile in two very divergent species, suggesting that such similarities may also translate to lifespan-extending effects in humans as is known to occur in rodents. These findings provide insight into the shared molecular mechanisms elicited by CR and highlight promising pathways for therapeutic targets to combat age-related diseases and promote longevity in humans. Overall design: Male Fisher 344 rats (n=54) were randomly assigned to two groups at 2 months of age. One group was kept ad libitum (AL) fed throughout their lifespan while the calorie restriction (CR) group was progressively brought down to a 40% CR. All animals were fed a NIH-31 standard chow (Harlan Teklad, Indianapolis, IN, USA). Rats were singly housed in an environmentally controlled vivarium with unlimited access to water and a controlled photoperiod (12 hr. light;12 hr. dark). Body weights and food intake were recorded biweekly. All rats were maintained between 68-72°F according to animal protocols and NIH guidelines. Total RNA was extracted from the vastus lateralis skeletal muscle using Trizol Reagent (Invitrogen, Carlsbad, CA) following the manufacturer’s instructions, n=5 from each group. Total RNA samples were biotin labeled and hybridized to RatRef-12 v1 Gene Expression beadchips (Illumina, San Diego, CA) following Illumina protocols. Arrays were washed and scanned using an Illumina BeadArray 500GX reader. Microarray florescent signals were extracted using the Illumina GenomeStudio Gene Expression software(v1.6.0) and any spots at or below the background were filtered using an Illumina detection p-value of 0.02 and above. The natural log of all remaining scores were used to find the avg and std of each array and the z-score normalization was calculated . Correlation analysis, sample clustering analysis and principal component analysis include all of probes are performed to identify/exclude any possible outliners. The resulting dataset was next analyzed with DIANE 6.0, a spreadsheet based microarray analysis program. Gene set enrichment analysis use gene expression values or gene expression change values for all of the genes in the microarray. Parametric analysis of gene set enrichment (PAGE) was used [pubmed 20682848] for gene set analysis. Gene Sets include the MSIG database [Link], Gene Ontology Database [Link], GAD human disease and mouse phenotype gene sets [pubmed: 20092628] were used to explore functional level changes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tables containing additional information on genes and cell obtained from single-cell RNA-Seq analysis of mouse testis data. SuppTableCells.xlsx contains the 10X Barcode as identifier, the replicate ID, position in the t-SNE plot, UMI and gene count per cell, the proportion of mitochondrial transcripts, cluster ID obtained by k-Means clustering (k=9), inferred cell type, and pseudotime information obtained using monocle and Scrat.SuppTableGenes contains the average expression value, fold-change compared to other cell types, and p-value relative to other cell types for each gene that was expressed in the dataset. Throughout the data, the following abbreviations for cell types are used: Spg=spermatogonia, SC=spermatocytes, RS=round spermatids, ES=elongating spermatids, CS=condensed/condensing spermatids. In cases where several clusters were identified per cell type, the earlier cluster was designated as 1.
Genotype specific differences in expression profiles have been evaluated using human HuGene1.0-ST Gene Chips. In this dataset we include expression data obtained from 8 normal adrenal medulla and 45 PHEOs/PGLs patient samples. Viable appearing tissue from the center of the lesions was collected and snap frozen for RNA extraction. Each of the 45 PHEO/PGL samples was examined by pathologist upon resection. Patients PKh_27 and PKh_28 with SDHB mutation were from the same patient with samples taken from two different locations at different times. Diagnosis of PHEO/PGL has been confired in all cases histopathologically. The tissues were grouped according to genetic/syndromic background and tumor location into SDHB (n = 18), SDHD-A/T (n = 6), SDHD-HN (n= 8), and VHL (n = 13). Microarray analysis was performed on normal and tumor samples. We used PAM model to identify minimum subset of genes selective for each mutation class. Heirarchical cluster analysis was used to identify samples with similar expression patterns. Data was validated using qRT-PCR analysis.
Background: Even though much progress has been made in the understanding of the molecular nature of glioma, the survival rates of patients affected of this tumour have not changed significantly during these years. Thus, a deeper understanding of this malignancy is still needed in order to predict its outcome and improve patient treatment. Here, we report that VAV1, a GDP/GTP exchange factor for Rho/Rac proteins with oncogenic potential that is involved in the regulation of cytoskeletal dynamics and cell migration. Methodology/Principal Findings: VAV1 is overexpressed in 32 patients diagnosed with high-grade glioma. Such overexpression is linked to the parallel upregulation of a number of genes coding for proteins also involved in cell invasion- and migration-related processes. Unexpectedly, immunohistochemical experiments revealed that VAV1 is not expressed in glioma cells. Instead, VAV1 is found in non-tumoral astrocyte-like cells that are located either peritumoraly or perivascularly, suggesting that its expression is linked to synergistic signalling cross-talk between cancer and infiltrating cells. Conclusions/Significance: Interestingly, we show that the pattern of expression of VAV1 is a good prognostic factor to unveil populations of high-grade glioma patients with different survival and progression free survival rates. 1. Oligonucleotide microarray analyses Total RNAs were extracted using the Triazol reagent (Life Technologies, Gaithersburg, MD, USA) and purified with the RNeasy Mini kit (Qiagen, Valencia, CA, USA). The integrity of RNA samples obtained was assessed using the 2100 Bioanalyzer (Agilent, Palo Alto, CA, USA). Double-stranded cDNAs and biotinylated cRNAs were synthesized using a T7-polyT primer and the BioArray RNA labelling Kit (Enzo Farningdale, NY, USA), respectively. Labelled RNAs were then fragmented and hybridised to HU-133A oligonucleotide arrays (Affymetrix, Santa Clara, CA, USA) according to standard Affymetrix protocols. After hybridization and washes, arrays were scanned using the Gene Array Scanner (Affymetrix), and the expression value for each probe set calculated using the MAS 5.0 software (Affymetrix). All examples had a scaling factor lower than threefold and 3’/5’ of GAPDH probe set <2.5. Gene levels were transformed to base two logarithms. A median normalization approach was applied. Only genes with al least three “present” calls across all samples were selected. All these steps were done at the Genomics and Proteomics Unit of the “Centro de Investigación del Cáncer, Salamanca”. 2. Microarray data analyses To visualize clusters of genes with similar expression patterns, we used a hierarchical clustering method (Cluster and TreeView software) based on the average-linkage method with the centred correlation metric [26]. A multidimensional scaling method (BRB Arrays Tools version 3.0) was also utilized by using Euclidean distance criteria [27]. Supervised learning was used to identify genes with statistically significant changes in expression among different classes by using the Significant Analysis of Microarrays (SAM) algorithm [28]. All data were permuted over 100 cycles by using the two-class (unpaired) and multi-class response format. Significant genes were selected based on the lowest false discovery ratio (between 0.6 and 0.9). In addition, nonparametric tests such as Wilcoxon rank sum test and Kruskal-Wallis test to compare more than two unpaired group were also used (SPSS 18, SPSS Inc). 3. Functional annotation of microarray data Probe sets showing significant expression change were functionally annotated and grouped according to biological function criteria using GeneOntology biological process descriptions. The functional analysis to identify the most relevant biological mechanism, pathways and functional categories in gene dada sets was generated using the Ingenuity Pathway software (Ingenuity Systems, Mountain View, CA, USA) available in the web (www.ingenuity.com) [29]. A functional network was considered significant when it fulfilled the following criteria: i) to have a minimal score of 15; ii) to have a minimum of 20 direct functional interactions among the network members. 4. Quantitative reverse transcription-PCR Total RNA was quantified in a RNA 6000 Nano Chip (Agilent Technologies) and quantitative PCR performed using the QuantiTect SYBR Green RT–PCR kit (Qiagen). To quantify VAV1 mRNA levels, we used two different sets of probes: PAIR A (5’-AAC AAC GGG AGG TTC ACC CT-3’ and 5’-GGT CCC TCA TGG CAT CCA-3’) and PAIR B (5’-AGC CAT TGG ACC CTT TCT ACG-3’ and 5’-GCC ATG GAC ATA GGG CTT CA-3’). Amplifications were performed using the iCycler apparatus (Bio-Rad Laboratories, California, USA). Analyses of data were done using the iCycler iQ Optical System Software, version 3.0a (Bio-Rad Laboratories). Primers to GAPDH were used as intersample normalizing controls. Variations in expression of VAV1 mRNA were represented as the mean value of the fold change respect the VAV1 expression levels detected in sample #19209 with both pairs of oligonucleotide primers. 5. Immunohistochemical analyses The VAV1 antibody was generated in rabbits using a synthetic peptide and purified by affinity chromatography in Bustelo’s laboratory. This antibody recognizes VAV1 proteins from humans and mice but it does not recognize other VAV family members (unpublished data). For immunostaining, tissue sections were washed thrice with Xylene and once with 100% ethanol, rehydrated by sequential changes in 80%, 70%, and 50% ethanol and a final incubation in phosphate-buffered saline (PBS). Each rehydrating step involved 3 min incubations with the indicated solutions. Endogenous peroxidases were quenched by the addition of a 3% H2O2 solution in methanol for 30 min at room temperature (RT). Tissue sections were subsequently washed twice with PBS. Antigen retrieval was performed by incubation in 1 mM EDTA for 30 min at 37°C. The slides were washed twice in PBS and blocked in blocking buffer (Zymed, CA, USA) for 30 min at RT. Specimens were then incubated with the primary antibody (1:250 dilution) in blocking buffer. After an 1 hr incubation at 37°C, slides were washed three times in PBS, incubated with a biotinylated secondary antibody for 30 min at 37°C, washed thrice in PBS, incubated with horseradish peroxidase-streptavidin for 30 min at 37°C, washed three times in PBS, and developed using the AEC substrate (Zymed). Slides were then washed twice in water, counterstained with hematoxilin (Zymed), washed again in water, and mounted with GVA (Zymed). Samples were analyzed by light microscopy and images acquired suing an Axiophot imaging system (Zeiss, Munich, Germany). 6. Fluorescence in situ hybridization analyses FISH experiments were carried out in 40 cases of glioblastoma multiforme (grade IV) positive for VAV1 expression. For this purpose, we performed dual-colour FISH analyses with locus-specific probes for centromere 7 (Abbott Molecualr, Des Plaines) exactly as previously described.[30] Polysomies were defined when more than 10% of the nuclei surveyed contained three or more CEP signals (chromosome-specific FISH probes that hybridize to highly repetitive human satellite DNA sequences, usually located near centromeres). 7. Immunohistochemistry and fluorescence in situ hybridization (FISH) in paraffin-embedded tumours Four um sections were cut from routinely processed paraffin blocks and mounted onto glass slides with a charged coating. Sections were dewaxed in Xylene and then rehydrated using increasing concentrations of alcohol before being rinsed briefly in water. Slides were heated 2 min in 1 mM EDTA (pH 9.0) in a microwavable pressure cooker. After antigen retrieval, slides were incubated 1 h at RT in a moist chamber with a primary antibody diluted in PBS supplemented with 10% foetal calf serum. Slides were incubated for 1 h with fluorochrome-conjugated antibodies to the appropriate IgG isotypes in a moist chamber in the dark. Finally, slides were washed thrice in PBS containing 0.5% Tween 20 three before FISH analysis. 8. Degenerate oligonucleotide primed-polymerase chain reaction (DOP-PCR) analyses After the staining of tissue sections with VAV1 antibodies (see above), the regions of the tumour were identified, microdissected, and collected using the PALM® microscope system (P.A.L.M. Microlaser Technologies, Munich, Germany). The genomic DNA was extracted as indicated by Isola et al [31] with modifications to small DNA amounts. Those included the resuspension of the microdissected sections in extraction buffer followed by a digestion with proteinase K (0.6 mg/ml). All samples were resuspended in 10 ul of 10 mM Tris-HCl (pH 7.4) and 0.1 mM EDTA. DOP-PCR amplification was performed in two steps. For the first, low-stringency step, 1 ul of sample was added to 4 ul of buffer A (2.5 ul of 600 uM dNTPs (Roche, Pleasanton, CA), 0.5 ul of 10 uM DOP primer 5’-CCGACTCGAGNNNNNNNATGTGG-3’, where N= A, C, G, or T) [32] and 1 ul of 5x Sequenase Reaction Buffer (Amersham, Cleveland, OH). Reactions were performed using 5 cycles of 30ºC for 5 min, 37ºC for 2 min, and 96ºC for 2 min, adding 0.65 units of Sequenase in each 30ºC step. The first phase product was then subjected to the second step usin
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
100 microarray and RNA-seq gene expression datasets from five model species (human, mouse, fruit fly, arabidopsis plants, and baker's yeast). These datasets represent the benchmark set that was used to test our clust clustering method and to compare it with five widely used clustering methods (MCL, k-means, hierarchical clustering, WGCNA, and self-organising maps). This data resource includes raw data files, pre-processed data files, clustering results, clustering results evaluation, and scripts.
The files are split into three zipped parts, 100Datasets_part_1.zip, 100Datasets_part_2.zip, and 100Datasets_part_3.zip. The contents of the three zipped files should be extracted to a single folder (e.g. 100Datasets).
Below is a thorough description of the files and folders in this data resource.
Scripts
The scripts used to apply each one of the clustering methods to each one of the 100 datasets and to evaluate their results are all included in the folder (scripts/).
Datasets and clustering results (folders starting with D)
The datasets are labelled as D001 to D100. Each dataset has two folders: D###/ and D###_Res/, where ### is the number of the dataset. The first folder only includes the raw dataset while the second folder includes the results of applying the clustering methods to that dataset. The files ending with _B.tsv include clustering results in the form of a partition matrix. The files ending with _E include metrics evaluating the clustering results. The files ending with _go and _go_E respectively include the enriched GO terms in the clustering results and evaluation metrics of these GO terms.
Simultaneous analysis of multiple datasets (folders starting with MD)
As our clust method is design to be able to extract clusters from multiple datasets simultaneously, we also tested it over multiple datasets. All folders starting with MD_ are related to "multiple datasets (MD)" results. Each MD experiment simultaneously analyses d randomly selected datasets either out of a set of 10 arabidopsis datasets or out of a set of 10 yeast datasets. For each one of the two species, all d values from 2 to 10 were tested, and at each one of these d values, 10 different runs were conducted, where at each run a different subset of d datasets is selected randomly.
The folders MD_10A and MD_10Y include the full sets of 10 arabidposis or 10 yeast datasets, respectively. Each folder with the format MD_10#_d#_Res## includes the results of applying the six clustering methods at one of the 10 random runs of one of the selected d values. For example, the "MD_10A_d4_Res03/" folder includes the clustering results of the 3rd random selection of 4 arabidopsis datasets (the letter A in the folder's name refers to arabidopsis).
Our clust method is applied directly over multiple datasets where each dataset is in a separate data file. Each "MD_10#_d#_Res##" folder includes these individual files in a sub-folder named "Processed_Data/". However, the other clustering methods only accept a single input data file. Therefore, the datasets are merged first before being submitted to these methods. Each "MD_10#_d#_Res##" folder includes a file "X_merged.tsv" for the merged data.
Evaluation metrics (folders starting with Metrics)
Each clustering results folder (D##_Res or MD_10#_d#_Res##) includes some clustering evaluation files ending with _E. This information is combined into tables for all datasets, and these tables appear in the folders starting with "Metrics_".
Other files and folders
The GO folder includes the reference GO term annotations for arabidopsis and yeast. The Datasets file includes a TAB delimited table describing the 100 datasets. The SearchCriterion file includes the objective methodology of searching the NCBI database to select these 100 datasets. The Specials file includes some special considerations for couple of datasets that differ a bit from what is described in the SearchCriterion file. The Norm### files and the files in the Reps/ folder describe normalisation codes and replicate structures for the datasets and were fed to the clust method as inputs. The Plots/ folder includes plots of the gene expression profiles of the individual genes in the clusters generated by each one of the 6 methods over each one of the 100 datasets. Only up to 14 clusters per method are plotted.