Spectral Preprocessing for Clustering Time-Series Gene Expressions
© Wentao Zhao et al. 2009
Received: 31 July 2008
Accepted: 19 January 2009
Published: 24 February 2009
Based on gene expression profiles, genes can be partitioned into clusters, which might be associated with biological processes or functions, for example, cell cycle, circadian rhythm, and so forth. This paper proposes a novel clustering preprocessing strategy which combines clustering with spectral estimation techniques so that the time information present in time series gene expressions is fully exploited. By comparing the clustering results with a set of biologically annotated yeast cell-cycle genes, the proposed clustering strategy is corroborated to yield significantly different clusters from those created by the traditional expression-based schemes. The proposed technique is especially helpful in grouping genes participating in time-regulated processes.
A cell is the basic unit of life, and each cell contains instructions necessary for its proper functioning. These instructions are encoded in the form of DNAs that are replicated and transmitted to its progeny when the cell divides. mRNAs are middle products in this process. They are transcribed from DNA segments (genes) and serve as the templates for protein translation. This conduit of information constitutes the central dogma of molecular biology. The fast evolving gene microarray technology has enabled simultaneous measurement of genome-wide gene expressions in terms of mRNA concentrations. There are two types of microarray data: time series and steady state. Time-series data are obtained by sequential measurements in temporal experiments, while steady-state data are produced by recording gene expressions from independent sources, for example, different individuals, tissues, experiments, and so forth. The high costs, ethical concerns, and implementation issues prevent from collecting large time-series data sets. Therefore, about 70% of the data sets are steady state , and most of time-series data sets contain only a few time points, in general less than 20 samples.
Based on microarray measurements, clustering methods have been exploited to partition genes into subsets. Members in each subset are assumed to share specific biological function or participate in the same molecular-level process. They are termed as coexpressed genes and are supposed to be located closely in the underlying genetic regulatory networks. Eisen et al.  applied the hierarchical clustering to partition yeast genes, Tamayo et al.  exploited the self-organizing map (SOM), and Tavazoie et al.  employed K-means clustering to group gene expressions and then search upstream DNA sequence motifs that contribute to the coexpression of genes. Besides the above mentioned successful applications, Zhou et al.  designed a clustering strategy by minimizing the mutual information between clusters, and bootstrap techniques were combined with heuristic search to solve the underlying optimization problem. Also, Giurcăneanu et al.  exploited the minimum description length (MDL) principle to determine the number of clusters. Whether technically advanced schemes represent better solutions for real biological data is still under debate. However, usually most of the schemes provide valuable alternatives and insights to each other. Therefore, it was recommended that several clustering schemes be performed to analyze the same real data set  so that the difference between clusterings would capture some patterns that otherwise would be neglected by running only one method.
A straightforward application of clustering schemes will cause the loss of temporal information inherent in the time-series measurements. This shortcoming has been noticed in literature. Ramoni et al.  designed a model-based Bayesian method to cluster the time-series data and specified the number of clusters intelligently, Tabus and Astola  proposed to fit the data by linear dynamic systems, and Ernst et al.  presented an algorithm especially for short time series. In these models genes in the same cluster were assumed to share similar time domain profile. The temporal relationships were also explored via more complex models, that is, genetic regulatory networks, which can be constructed via more computationally-demanding algorithms, for example, Zhao et al.  and Liang et al. . However, in general, the network inference schemes deal only with relatively small-scale networks consisting of less than hundreds of genes. Genome wide analysis is beyond the computational capability of these inference algorithms. Therefore, clustering methods are usually exploited to partition genes, and the obtained subsets of genes serve as further research targets, and more accurate maps of real biological processes are to be recovered.
Based on time-series data, modern spectral density estimation methods have been exploited to identify periodically expressed genes. Assuming the cell cycle signal to be a single sinusoid, Spellman et al.  and Whitfield et al.  performed a Fourier transformation on the data sampled with different synchronization methods, Wichert et al.  applied the traditional periodogram and Fisher's test, while Ahdesmäki et al.  implemented a robust periodicity test procedure assuming non-Gaussian noise. The majority of these works dealt with evenly sampled data, and missing data points were usually filled by interpolation in time domain, or the genes were disregarded if there were too many vacancies.
The biological experiments generally output unequally spaced measurements. The change of sampling frequency is due to missing data and the fact that the measurements are usually event driven, that is, more observations are taken when certain biological events occur, and the measurement process is slowed down when the cell remains quiet. Therefore, an analysis based on unevenly sampled data is practically desired and technically more challenging. The harmonics exploited in discrete Fourier transform (DFT) are no longer orthogonal in the presence of uneven sampling. Lomb  and Scargle  demonstrated that a phase shift suffices to make the sine and cosine terms orthogonal again. The Lomb-Scargle scheme has been exploited in analyzing the budding yeast data set by Glynn et al. . Stoica and Sandgren  updated the traditional Capon method to cope with the irregularly sampled data. Notice also that Wang et al.  designed the missing-data amplitude and phase estimation (MAPES) approach, which estimated the missing data and spectrum iteratively through the usage of the Expectation Maximization (EM) algorithm. Although Capon and MAPES methods aim to achieve a better spectral resolution than Lomb-Scargle periodogram, for small sample size, the simpler Lomb-Scargle periodogram appears to possess higher accuracy in the presence of real biological data sets .
This paper proposes a novel clustering preprocessing procedure which combines the power spectral density analysis with clustering schemes. Given a set of microarray measurements, the power spectral density of each gene is first computed, then the spectral information is fed into the clustering schemes. The members within the same cluster will share similar spectral information, therefore they are supposed to participate in the same temporally regulated biological process. The assumptions underlying this statement rely on the following facts: if two genes X and Y are in the same cluster, their spectral densities are very close to each other; in the time domain, their gene expressions may just differ in their phases. The phases are usually modeled to correspond to different stages of the same biological processes, for example, cell cycle or circadian rhythms. The proposed spectral-density-based clustering actually differentiates the following two cases.
Gene X's expression and Gene Y's expression are uncorrelated in both time and frequency domains.
Gene X and Y expressions are uncorrelated in time domain, but gene X's expression is a time-shifted version of gene Y's expression.
In the traditional clustering schemes, the distances are the same for the above two cases (both assuming large values). However, in the proposed algorithm, the second case is favorable and presents a lower distance. Therefore, by exploiting the proposed algorithm, the genes participating in the same biological process are more likely to be grouped into the same cluster. Lomb-Scargle periodogram serves as the spectral density estimation tool since it is computationally simple and possesses higher accuracy in the presence of unevenly measured and small-size gene expression data sets. The appropriate clustering method is determined based on intense computer simulations. Three major clustering methods: hierarchical, K-means, and self-organizing map (SOM) schemes are tested with different configurations. The spectra and expression-based clusterings are compared with respect to their ability of grouping cell-cycle genes that have been experimentally verified. The differences between clusterings are recorded and compared in terms of information theoretic quantities.
This section explains how to apply the Lomb-Scargle periodogram to time-series gene expressions. Next are formulated briefly the three clustering schemes: hierarchical, K-means, and self-organizing map (SOM). Afterward, we discuss how to validate the clusterings and make comparisons between them. The notational convention is as follows: the matrices and vectors are in bold face, and scalars are represented in regular font.
2.1. Lomb-Scargle Periodogram
Most spectral analysis methods, for example, Fourier transform and traditional periodogram employed in Spellman et al.  and Wichert et al. , rely on evenly sampled data, which are projected on orthogonal sine and cosine harmonics. However, real microarray measurements are not evenly observed due to missing data points and changing sampling frequency. The uneven sampling ruins data projection's orthogonality. Lomb  found that a phase shift of the sine and cosine functions would restore the orthogonality among harmonics. Scargle  complemented Lomb's periodogram by exploiting its distribution. Since then the established Lomb-Scargle periodogram has been exploited in numerous fields and applications, including bioinformatics and genomics (see, e.g., Glynn et al. ).
Notice further that the spectra at the front and rear halves of the frequency grid are symmetric since the microarray experiments output real values.
Lomb-Scargle periodogram represents an efficient solution in estimating the spectra of unevenly sampled data sets. Simulation results also verify its superior performance for biological data with small sample size and various unevenly sampled patterns .
The obtained Lomb-Scargle power spectral density will be used as input to clustering schemes as an alternative to the original gene expression measurements. Three clustering schemes: Hierachical, K-means, and self-organizing map (SOM) are used for testing this substitution.
2.2.1. Hierarchical Clustering
Distance metric between two genes' measurements and
is the matrix transpose.
represents sample size, and indexes a specific sample.
, are means of vectors and , respectively.
Distance metric between two clusters and
is defined in Table 1.
obtains the size of the cluster.
The single linkage method actually constructs a minimal spanning tree, and it sometimes builds an undesirable long chain. The complete linkage method discourages the chaining effect and in each step increases the cluster diameter as little as possible. However, it assumes that the true clusters are compact. Alternatively, the average linkage method makes a compromise and is usually the preferred method since it poses no assumption on the structure of clusters. The selection of distance metric and linkage method depends on the nature of the real data, and several clustering schemes were proposed to be tested at the same time so that each can capture different aspects of the data. The hierarchical clustering scheme can be formulated in terms of the pseudo code depicted in Algorithm 1. If a specific number of clusters are desired, only line 3 is needed to be changed by substituting for .
Algorithm 1: Hierarchical clustering algorithm.
1: Input genes with their expressions or spectral densities;
2: Initialize ;
3: while do
5: Insert , delete and ;
6: Label all existing clusters with integers ;
8: end while
2.2.2. K-Means Clustering
The K-means clustering divides the genes into predetermined clusters. It iteratively updates the centroid of each cluster and reassigns each gene to the cluster with the nearest centroid. Different distance metrics, as listed in Table 1, can also be exploited in the K-means clustering scheme. In each iteration, the new centroid might be the median or mean of the cluster members. The K-means clustering can be formulated as Algorithm 2. One of the problems associated with K-means clustering is that the iterations may finally converge to a local suboptimum solution. Therefore, in our simulation we ran the algorithm 5 times and reported the one with the best performance. The K-means clustering method was exploited by Tavazoie et al. , which combined the clustering with the motif finding problem.
Algorithm 2: K-means clustering algorithm.
1: Input gene expressions or spectral densities, and the desired number of clusters ;
2: Randomly create centroids ;
3: Assign each gene to the cluster ;
4: while members in some clusters change do
5: compute centroids ;
6: assign gene to cluster ;
7: end while
2.2.3. Self-Organizing Map (SOM) Clustering
where the function defines the distance between two nodes indexed by and in the two-dimensional lattice. It can be set to 1 if node is within the neighborhood of node , and 0 otherwise. The function represents the learning rate function, and it is monotonically decreasing with the increase of or . The SOM clustering algorithm can be formulated as Algorithm 3.
Algorithm 3: SOM clustering algorithm.
1: Input gene expressions or spectral densities, the desired number of clusters , and the number of max iterations ;
2: Randomly create centroids ;
3: Assign each gene to the cluster ;
4: for to do
5: Randomly select a gene expresssion ;
6: Find the point ;
7: Update centroids based on (6);
8: end for
9: Assign each gene to cluster ;
2.3. Performance Evaluation Metric
The three clustering schemes with inputs of either gene expressions or spectral densities are to be evaluated in two different ways: how they group time-regulated genes, and whether they are significantly different from each other. Different criteria are defined based on information theoretic quantities.
2.3.1. Validation of Clustering Scheme
where measures the size of a cluster. Genes cooperate by participating in the same biological processes, in other words, singleton clusters are not expected to occur frequently in the clustering. Therefore, for a given the sizes of clusters should be balanced, and the higher the entropy of the clustering, the better the clustering scheme.
It is desirable that genes with the same functions be integrated in as small number of clusters as possible. Therefore, the smaller the joint entropy, the better the clustering.
where the is defined similarly as in (7), and it is constant across different clustering schemes. This metric is actually consistent with that proposed in Gibbons and Roth , whereby multiple gene attributes were considered. Higher mutual information between the clustering and the prespecified set stands for a balanced clustering for all genes while genes of are more accumulated, in other words, it exhibits better performance.
2.3.2. Difference between Two Clusterings
Two clustering schemes create two different partitions of all the observed genes. A measure of the distance between two clusterings is highly valuable when the two schemes do not show a significant difference in their performance. Various metrics have been proposed to evaluate the difference between two clusterings, for example, Fowlkes and Mallows , Rand , and more recently Meilă . We accept Meilă's variation of information (VI) metric because it is more discriminative, makes no assumption on the clustering structure, requires no rescaling, neither does it depend on the sample size.
VI is upper bounded by . It is zero if and only if the two clusterings are exactly the same. The greater the variation of information, the larger the difference between the two clusterings.
The performance of the proposed power spectrum-based scheme is illustrated through comparisons with three traditional expression-based clustering schemes: Hierarchical, K-means, and self-organizing map (SOM). The comparisons are divided into two parts. In the first part, we evaluate their ability to group the cell-cycle involved genes, while the second part is devoted to illustrate the fact that the proposed schemes construct clusters that are significantly different from those created by the traditional schemes.
3.1. Clustering Performance Evaluation
These simulations were performed on the cdc15 data set published by Spellman et al. , which contained 24 time-series expression measurements of 6178 yeast genes. The hierarchical, K-means, and self-organizing map (SOM) clustering schemes were simulated having as inputs the computed spectral densities and the original expression data. The hierarchical and K-means clustering were configured with different distance and linkage methods, which are defined in Tables 1 and 2, respectively. The simulations were executed until up to 200 clusters were created.
Cell cycle has served as a research target in molecular biology for a long time since it plays a crucial rule in cell division, and medically it underlies the development of cancer. Experimentally 109 genes have been verified to participate in the cell-cycle process, and their interactions were recorded in the public database KEGG . Among them 104 genes were reported in Spellman's data set. The simulations tested how these genes were clustered with other genes. Intuitively, the more integrated are these 104 genes, the better is the clustering scheme. On the other hand, it is hoped that the size of the cluster is relatively balanced, and there should not be many singleton clusters (clusters containing only one gene).
The clustering performance is represented by an information theoretic quantity, that is, mutual information, which is defined between the obtained partition of all measured genes and the set of 104 genes. Higher mutual information indicates that the 104 cell-cycle genes are closely integrated into only a few clusters, and most clusters are balanced in size. In other words, with the same number of clusters, the higher the mutual information, the better the performance.
The proposed strategy is surely not constrained to detect cell cycle genes. However we have to confine our discussion to cell cycle here because the available data set is right for the purpose of cell cycle research. Besides, the cell cycle genes have been identified for a relatively long time with high confidence.
Figure 1(b) shows the results for the complete linkage method of the hierarchical clustering. Each cluster actually represents a complete subgraph. The complete linkage method discourages the chaining effect to occur in the single linkage method. The performance of spectral density-based clusterings is lower bounded by the worst performances of the traditional gene expression-based clusterings. For the gene expression-based clustering, the correlation and cosine approaches are better than the Euclidean and city-block approaches, while for the spectral density clustering, the Euclidean and city-block approaches exhibit the best performance.
Figure 1(c) plots the results for the average linkage method of the hierarchical clustering. The average linkage is the most widely deployed method since it makes a compromise between the single and the complete methods, and it does not assume any structure on the underlying data. However, in the presence of real gene expression data, it is not as good as the complete linkage method. Different distance metrics differ in terms of their ability to group the involved cell-cycle genes. For clustering expression data, the cosine and correlation approaches still achieve the best performance, but they exhibit poorer performance than the spectra-based Euclidean and city-block methods.
The inferior performance of correlation and cosine metrics with spectra input is partially due to the flat spectra for those genes with no time-regulated patterns. The flat spectrum in the denominator will cause the distance metrics to be highly biased. It is also worthwhile to note that in literature other distance metrics have been proposed, for example, coherence  and mutual information . However, these metrics involve the estimation of joint distribution, which usually requires large sample sizes. Such a requirement cannot be satisfied in general by the microarray experiments. Extra normalization of the spectrum can be performed, but simulation shows that it does not provide a significant or consistent improvement.
3.2. Distance between Clusterings
A testing of the distance between spectra-based and gene expression-based clusterings also reveals the value of the proposed scheme. The variation of information metric approach, proposed by Meilă , is exploited to measure the difference between the two clusterings. The basic principle resumes to: the higher the variation of information, the greater the difference.
A novel clustering preprocessing strategy is proposed to combine the traditional clustering schemes with power spectral analysis of time-series gene expression measurements. The simulation results corroborate that the proposed approach achieves a better clustering for hierarchical, K-means, and self-organizing map (SOM) in most cases. Besides, it constructs a significantly different partition relative to traditional clustering strategies. When deploying the hierarchical or K-means clustering methods based on the spectral density, the Euclidean and city-block distance metrics appear to be more appealing than the cosine or correlation distance metrics. The proposed novel algorithm is valuable since it provides additional information about temporal regulated genetic processes, for example, cell cycle.
This work was supported by the National Cancer Institute (CA-90301) and the National Science Foundation (ECS-0355227 and CCF-0514644).
- Simon I, Siegfried Z, Ernst J, Bar-Joseph Z: Combined static and dynamic analysis for determining the quality of time-series expression profiles. Nature Biotechnology 2005, 23(12):1503-1508. 10.1038/nbt1164View ArticleGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 1998, 95(25):14863-14868. 10.1073/pnas.95.25.14863View ArticleGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, et al.: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America 1999, 96(6):2907-2912. 10.1073/pnas.96.6.2907View ArticleGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics 1999, 22(3):281-285. 10.1038/10343View ArticleGoogle Scholar
- Zhou X, Wang X, Dougherty ER, Russ D, Suh E: Gene clustering based on clusterwide mutual information. Journal of Computational Biology 2004, 11(1):147-161. 10.1089/106652704773416939View ArticleGoogle Scholar
- Giurcăneanu CD, Tăbuş I, Astola J, Ollila J, Vihinen M: Fast iterative gene clustering based on information theoretic criteria for selecting the cluster structure. Journal of Computational Biology 2004, 11(4):660-682.View ArticleGoogle Scholar
- D'Haeseleer P: How does gene expression clustering work? Nature Biotechnology 2005, 23(12):1499-1501. 10.1038/nbt1205-1499View ArticleGoogle Scholar
- Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics. Proceedings of the National Academy of Sciences of the United States of America 2002, 99(14):9121-9126. 10.1073/pnas.132656399View ArticleMathSciNetMATHGoogle Scholar
- Tabus I, Astola J: Clustering the non-uniformly sampled time series of gene expression data. Proceedings of the International Symposium on Signal Processing and Applications (ISSPA '03), Paris, France, July 2003 2: 61-64.Google Scholar
- Ernst J, Nau GJ, Bar-Joseph Z: Clustering short time series gene expression data. Bioinformatics 2005, 21(supplement 1):i159-i168.View ArticleGoogle Scholar
- Zhao W, Serpedin E, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22(17):2129-2135. 10.1093/bioinformatics/btl364View ArticleGoogle Scholar
- Liang S, Fuhrman S, Somogyi R: Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Proceedings of the Pacific Symposium on Biocomputing, Maui, Hawaii, USA, January 1998 3: 18-29.Google Scholar
- Spellman PT, Sherlock G, Zhang MQ, et al.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9(12):3273-3297.View ArticleGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha AJ, et al.: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell 2002, 13(6):1977-2000. 10.1091/mbc.02-02-0030.View ArticleGoogle Scholar
- Wichert S, Fonkianos K, Strimmer K: Identifying periodically expressed trascripts in microarry time series data. Bioinformatics 2004, 20(1):5-20. 10.1093/bioinformatics/btg364View ArticleGoogle Scholar
- Ahdesmäki M, Lähdesmäki H, Pearson R, Huttunen H, Yli-Harja O: Robust detection of periodic time series measured from biological systems. BMC Bioinformatics 2005, 6, article 117: 1-18.Google Scholar
- Lomb NR: Least-squares frequency analysis of unequally spaced data. Astrophysics and Space Science 1976, 39(2):447-462. 10.1007/BF00648343View ArticleGoogle Scholar
- Scargle JD: Studies in astronomical time series analysis—II. Statistical aspects of spectral analysis of unevenly spaced data. The Astrophysics Journal 1982, 263(99):835-853.View ArticleGoogle Scholar
- Glynn EF, Chen J, Mushegian AR: Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics 2006, 22(3):310-316. 10.1093/bioinformatics/bti789View ArticleGoogle Scholar
- Stoica P, Sandgren N: Spectral analysis of irregularly-sampled data: paralleling the regularly-sampled data approaches. Digital Signal Processing 2006, 16(6):712-734. 10.1016/j.dsp.2006.08.012View ArticleGoogle Scholar
- Wang Y, Stoica P, Li J, Marzetta TL: Nonparametric spectral analysis with missing data via the EM algorithm. Digital Signal Processing 2005, 15(2):191-206. 10.1016/j.dsp.2004.10.004View ArticleGoogle Scholar
- Zhao W, Agyepong K, Serpedin E, Dougherty ER: Detecting periodic genes from irregularly sampled gene expressions: a comparison study. EURASIP Journal on Bioinformatics and Systems Biology 2008, 2008:-8.Google Scholar
- Eyer L, Bartholdi P: Variable stars: which Nyquist frequency? Astronomy and Astrophysics 1999, 135(1):1-3.Google Scholar
- KEGG Yeast Cell Cycle Pathway http://www.genome.ad.jp/kegg/pathway/sce/sce04111.html
- Gibbons FD, Roth FP: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Research 2002, 12(10):1574-1581. 10.1101/gr.397002View ArticleGoogle Scholar
- Fowlkes E, Mallows C: A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 1983, 78(383):553-569. 10.2307/2288117View ArticleMATHGoogle Scholar
- Rand WM: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 1971, 66(336):846-850. 10.2307/2284239View ArticleGoogle Scholar
- Meilă M: Comparing clusterings—an information based distance. Journal of Multivariate Analysis 2007, 98(5):873-895. 10.1016/j.jmva.2006.11.013View ArticleMathSciNetMATHGoogle Scholar
- Butte AJ, Bao L, Reis BY, Watkins TW, Kohane IS: Comparing the similarity of time-series gene expression using signal processing metrics. Journal of Biomedical Informatics 2001, 34(6):396-405. 10.1006/jbin.2002.1037View ArticleGoogle Scholar
- Brillinger DR: Second-order moments and mutual information in the analysis of time series. In Recent Advances in Statistical Methods. Imperial College Press, London, UK; 2002:64-76.View ArticleGoogle Scholar
- Supplementary Materials http://www.ece.tamu.edu/~wtzhao/EurasipBSBClutering.htm
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.