Clustering of Gene Expression Data Based on Shape Similarity
 Travis J Hestilow^{1} and
 Yufei Huang^{1, 2}Email author
DOI: 10.1155/2009/195712
© T. J. Hestilow and Y. Huang. 2009
Received: 4 August 2008
Accepted: 27 January 2009
Published: 4 March 2009
Abstract
A method for gene clustering from expression profiles using shape information is presented. The conventional clustering approaches such as Kmeans assume that genes with similar functions have similar expression levels and hence allocate genes with similar expression levels into the same cluster. However, genes with similar function often exhibit similarity in signal shape even though the expression magnitude can be far apart. Therefore, this investigation studies clustering according to signal shape similarity. This shape information is captured in the form of normalized and timescaled forward first differences, which then are subject to a variational Bayes clustering plus a nonBayesian (Silhouette) cluster statistic. The statistic shows an improved ability to identify the correct number of clusters and assign the components of cluster. Based on initial results for both generated test data and Escherichia coli microarray expression data and initial validation of the Escherichia coli results, it is shown that the method has promise in being able to better cluster timeseries microarray data according to shape similarity.
1. Introduction
Investigating the genetic structure and metabolic functions of organisms is an important yet demanding task. Genetic actions, interactions, how they control and are controlled, are determined, and/or inferred by data from many sources. One of these sources is timeseries microarray data, which measure the dynamic expression of genes across an entire organism. Many methods of analyzing this data have been presented and used. One popular method, especially for timeseries data, is genebased profile clustering [1]. This method groups genes with similar expression profiles in order to find genes with similar functions or to relate genes with dissimilar functions across different pathways occurring simultaneously.
There has been much work on clustering timeseries data and clustering can be done based on either similarity of expression magnitude or the shape of expression dynamics. Clustering methods include hierarchical and partitional types (such as Kmeans, fuzzy Kmeans, and mixture modeling) [2]. Each method has its strengths and weaknesses. Hierarchical techniques do not produce clusters per se; rather, they produce trees or dendrograms. Clusters can be built from these structures by later cutting the output structure at various levels. Hierarchical techniques can be computationally expensive, require relatively smooth data, and/or be unable to "recover" from a poor guess; that is, the method is unable to reverse itself and recalculate from a prior clustering set. They also often require manual intervention in order to properly delineate the clusters. Finally, the clusters themselves must be well defined. Noisy data resulting in illdefined boundaries between clusters usually results in a poor cluster set.
Partitional clustering techniques strive to group data vectors (in this case, gene expression profiles) into clusters such that the data in a particular cluster are more similar to each other than to data in other clusters. Partitional clustering can be done on the data itself or on spline representations of the data [3, 4]. In either case, squareerror techniques such as Kmeans are often used. Kmeans is computationally efficient and can always find the global minimum variance. However, it must know the number of clusters in advance; there is no provision for determining an unknown number of clusters other than repeatedly testing the algorithm with different cluster numbers, which for large datasets can be very time consuming. Further, as is the case with hierarchical methods, Kmeans is best suited for clusters which are compact and well separated; it performs poorly with overlapping clusters. Finally, it is sensitive to noise and has no provision for accounting for such noise through a probabilistic model or the like. A related technique, fuzzy Kmeans, attempts to mimic the idea of posterior cluster membership probability through a concept of "degree of membership." However, this method is not computationally efficient and requires at least an a priori estimate of the degree of membership for each data point. Also, the number of clusters must be supplied a priori, or a separate algorithm must be used in order to determine the optimum number of clusters. Another similar method is agglomerative clustering [5]. Modelbased techniques go beyond fuzzy Kmeans and actually attempt to model the underlying distributions of the data. The methods maximize the likelihood of the data given the proposed model [4, 6].
More recently, much study has been given toward clustering based on expression profile shape (or trajectory) rather than absolute levels. Kim et al. [7] show that genes with similar function often exhibit similarity in signal shape even though the expression magnitude can be far apart. Therefore, expression shape is a more important indication of similar gene functions than expression magnitude.
The same clustering methods mentioned above can be used based on shape similarity. An excellent example of a treebased algorithm using shapesimilarity as a criterion can be found in [8]. While the results of this investigation proved fruitful, it should be noted that the data used in the study resulted in welldefined clusters. Further, the clustering was done manually once the dendrogram was created. MöllerLevet et al. [9] used fuzzy Kmeans to cluster timeseries microarray data using shape similarity as a criterion. However, the number of clusters was known beforehand; no separate optimization method was used in order to find the proper number of clusters. Balasubramaniyan et al. [10] used a similarity measure over timeshifted profiles to find local (shorttime scale) similarities. Phang et al. [11] used a simple shape decomposition and used a nonparametric KruskalWallis test to group the trajectories. Finally, Tjaden [12] used a Kmeans related method with error information included intrinsically in the algorithm.
A common difficulty with these approaches is to determine the optimal number of clusters. There have been numerous studies and surveys over the years aimed at finding optimal methods for unsupervised clustering of data; for example, [13–20]. Different methods achieve different results, and no single method appears to be optimal in a global sense. The problem is essentially a model selection problem. It is well known that the Bayesian methods provide the optimal framework for selecting models, though a complete treatment is analytically intractable for most cases. In this paper, a Bayesian approach based on the Variational Bayes Expectation Maximization (VBEM) algorithm is proposed to determine the number of clusters and better performance than MDL and BIC criterion has been demonstrated.
In this study, the goal was to find clusters of genes with similar functions; that is, coregulated genes using timeseries microarray data. As a result, we choose to cluster genes based on signal shape information. Particularly, signal shape information is derived from the normalized timescaled forward first differences of the timesequence data. This information is then forwarded to a Variational Bayes Expectation Maximization algorithm (VBEM, [21]), which performs the clustering. Unlike Kmeans, VBEM is a probabilistic method, which was derived based on the Bayesian statistical framework and has shown to provide better performance. Further, when paired with an external clustering statistic such as the Silhouette statistic [22], the VBEM algorithm can also determine the optimal number of clusters.
The rest of the paper is organized as follows. In Section 2 the problem is discussed in more detail, the underlying model is developed, and the algorithm is presented. In Section 3 the results of our evaluation of the algorithm against both simulated and real timeseries data are shown. Also presented are comparisons between the algorithm and Kmeans clustering, both methods using several different criteria for making clustering decisions. Conclusions are summarized in Section 4. Finally, Appendices A, B, and C present a more detailed derivation of the algorithm.
2. Method
2.1. Problem Statement and Method
Given the microarray datasets of genes, for , where is the number of time points, that is, the columns in the microarray, it is desired to cluster the gene expressions based on signal shape. The clustering is not known a priori; therefore not only must individual genes be assigned to relevant clusters, but the number of clusters themselves must also be determined.
The clustering is based on expressionlevel shape rather than magnitude. The shape information is captured by the firstorder time difference. However, since the gene expression profiles were obscured by the varying levels manifested in the data, the time difference must be obtained on the expression levels with the same scale and dynamic range. Motivated by the observations, the proposed algorithm has three steps. In the first step, the expression data is rescaled. In the second step, the signal shape information is captured by calculating the firstorder time difference. In the last step, clustering is performed on the timedifference data using a Variational Bayes Expectation Maximization (VBEM) algorithm. In the following, each step is discussed in detail.
2.2. Initial Data Transformation
Each gene sequence was rescaled by subtracting the mean value of each sequence from each individual gene, resulting in sequences with zero mean. This operation was intended to mitigate the widely different magnitudes and slopes in the profile data. By resetting all genes to a zeromean sequence, the overall shape of each sequence could be better identified without the complication of comparing genes with different magnitudes.
After this, the resulting sequences were then normalized such that the maximum absolute value of the sequence was 1. Gene expression between related genes can result in a large change or a small; if two genes are related, that relationship should be recoverable regardless of the amplitude of change. By renormalizing the data in this manner, the amplitudes of both largechange and smallchange genes were placed into the same order of magnitude.
where represents the mean of .
2.3. Extraction of Shape Information and Time Scaling
To extract shape information of timevarying gene expression, the derivative of the expression trajectory is considered. Since we are dealing with discrete sequences, differences must be used rather than analytical derivatives. To characterize the shape of each sequence, a simple firstdifference scheme was used, this being the magnitude difference of the succeeding point and the point under consideration, divided by the time difference between those points. The data was taken nonuniformly over a period of approximately 100 minutes, with sample times varying from 7 to 50 minutes. As the transformation in (1) already scales the data to a range of , further compressing that scale by nearly 2 orders of magnitude over some time stretches was deemed neither prudent nor necessary. Therefore, the time difference was scaled in hours to prevent this unneeded range compression. The resulting sequences were used as data for clustering.
where is the length vector of time points associated with gene , is the vector of transformed timeseries data (from (1)) associated with gene , and is the resulting vector of first differences associated with gene .
2.4. Clustering
Once the sequence of first differences was calculated for each gene, clustering was performed on , the firstorder difference. To this end, a VBEM algorithm was developed. Before presenting that development, a general discussion of VBEM is in order.
where and are, respectively, the latent variables and the model parameters. The integration is taken over both variables and parameters in order to prevent overfitting, as a model with many parameters would naturally be able to fit a wider variety of datasets than a model with few parameters.
Maximizing this functional is equivalent to minimizing the KL distance between and . The distributions and are coupled and must be iterated until they converge.
where are the known parameters of the distribution. Given the transformed expressions of genes, , the stated two tasks are equivalent to estimating , the total number of clusters, and for all genes.
where is the marginal likelihood given the model has clusters, and is the a posteriori probability of when the total number of clusters is .
Unfortunately, there are now multiple unknown nuisance parameters at this point: , , , , , and all still need to be found. To do so requires a marginalization procedure over all the unknowns, which is intractable for unknown cluster id . Therefore, a VBEM scheme is adopted for estimating the necessary distributions.
2.5. VBEM Algorithm
where as above the inequality derives by use of Jensen's inequality. The free distributions and are introduced as approximations to the unknown distributions and . The distributions are chosen so as to maximize the lower bound. Using variational derivatives and an iterative coordinate ascent procedure, we find
Vbe Step:
Vbm Step:
where and are iterations and are normalizing constants to be determined. Because of the integration in (13), must be chosen carefully in order to have an analytic expression. By choosing as a member of the exponential family, this condition is satisfied. Note is an approximation to the posterior distribution and therefore can be used to obtain the estimate of .
2.6. Summary of VBEM Algorithm
The VBEM algorithm is summarized as follows:
 (1)
Initialization
 (i)
Initialize , , a, b, k, and L.
Iterate until lower bound converges enumerate
 (2)
VBE Step:
 (i)
for ,
 (ii)
calculate using (A.1) in Appendix A,
 (iii)
end .
 (3)
VBM Step:
 (i)
for ,
 (ii)
calculate using (B.1) in Appendix B,
 (iii)
End k.
 (4)
Lower bound:
 (i)
calculate using (C.1) in Appendix C.
End iteration.
2.7. Choice of the Optimum Number of Clusters
The Bayesian formulation of (11) suggests using the number of clusters that maximize the marginal likelihood, or in the context of VBEM, the lower bound . Instead of solely basing the determination of the number of clusters using , 4 different criteria are investigated in this work: (a) lower bound used within the VBEM algorithm (labelled KL), (b) the Bayes Information Criterion [23], (c) the Silhouette statistic performed on clusters built from transformed data, and (d) the Silhouette statistic performed on clusters built from raw data. The VBEM lower bound is discussed above; the BIC and Silhouette criteria are discussed below.
2.8. Bayes Information Criterion (BIC)
where is the likelihood function of data given parameters , is the size (dimensionality) of parameter set , and is the sample size. The term is a penalty term discouraging more complex models.
2.9. Silhouette Statistic
It is quickly seen that the range of this statistic is . A value close to 1 means the data vector is very probably assigned to the correct cluster, while a value close to means the data vector is very probably assigned to the wrong cluster. A value near 0 is a neutral evaluation.
3. Results
We illustrate the method using simulated expression data and with microarray data available online.
3.1. Simulation Study
Basis vectors for clusters in sample datasets.
Cluster  Subcluster  Mean vector  Mean shift  Scale factor 

a  a 
 0  1 
b  b 
 0  1 
bm 
 1  
c  c 
 0  1 
cs  0  0.25  
d  d 
 0  1 
dms 
 0.25  
e  e 
 0  1 
em 
 1  
es  0  0.25  
ems 
 0.25 
The datasets constructed from these basis vectors differed in number of data vectors per subcluster (and thus the total number of data vectors), and the standard deviation used to vary the individual vector values about their corresponding basis vectors. Generally speaking, the standard deviation vectors were constructed to be approximately 25% of the mean vector for the "lownoise" sets, and approximately 50% of the mean vector for the "highnoise" sets.
3.2. "LowNoise" Test Datasets
Standard deviation vectors for clusters in "lownoise" sample datasets.
Cluster  Standard deviation vector 

a 

b 

c 

d 

e 

3.3. "HighNoise" Test Datasets
Standard deviation vectors for clusters in "highnoise" sample datasets.
Cluster  Standard deviation vector 

a 

b 

c 

d 

e 

Subcluster replicates and total vector sizes for "highnoise" datasets.
Test set  Total replicates  Total N 

3  5  55 
4  9  99 
5  30  330 
6  50  550 
7  70  770 
8  99  1089 
3.4. Test Types and Evaluation Measures
To evaluate the ability of VBEM to properly cluster the datasets, two test sequences were conducted. First, the data was clustered using VBEM in a "controlled" fashion; that is, the number of clusters was assumed to be known and passed to the algorithm. Second, the algorithm was tested in an "uncontrolled" fashion; that is, the number of clusters was unknown, and the algorithm had to predict the number of clusters given the data. During the uncontrolled tests, a Kmeans algorithm was also run against the data as a comparison.
The VBEM algorithm as currently implemented requires an initial (random) probability matrix for the distribution of genes to clusters, given a value for . Therefore, for each dataset, 55 trials were conducted, each trial having a different initial matrix.
Also, each trial begins with an initial clustering of genes. As currently implemented, this initialization is performed using a Kmeans algorithm. The algorithm attempts to cluster the data such that the sum of squared differences between data within a cluster is minimized. Depending on the initial starting position, this clustering may change. In MATLAB, the builtin Kmeans algorithm has several options available to include how many different trials (from different starting points) are conducted to produce a "minimum" sumsquared distance, how many iterations are allowed per trial to reach a stable clustering, and how clusters that become "empty" during the clustering process are handled. For these tests, the Kmeans algorithm conducted 100 trials of its own per initial probability matrix (and output the clustering with the smallest sumsquared distance), had a limit of 100 iterations, and created a "singleton" cluster when a cluster became empty.
As mentioned above, the choice of optimum K was conducted using four different calculations. The first used the estimate for the VBEM lower bound, the second used the BIC equation. In both cases, the optimum for a particular trial was that which showed a decrease in value when was increased. This does not mean the values used to determine the optimum were the absolute maxima for the parameter within that trial; in fact, they usually were not. The overall optimum for a particular choice of parameter was the maximum value over the number of trials. The third and fourth criteria made use of the Silhouette statistic, one using the clusters of transformed data and one using the corresponding clusters of raw data. We used the builtin Silhouette function contained within MATLAB for our calculations. To find the optimum , the mean Silhouette value for all data vectors in a clustering was calculated for each value of . The value of for which the mean value was maximized was chosen as the optimum .
where is the probability that computed cluster belongs to a priori cluster given that is in fact the correct cluster, and is the probability of a priori cluster occurring. refers to the misclassification rate using statistic (KL, BIC, both Silhouette) for trial . This rate is in the range and is equal to 1 only when the number of clusters is properly predicted and those calculated clusters match the a priori clusters. Thus, both under and overprediction of clusters were penalized.
For the "uncontrolled" tests, the above 4 algorithms were tested with the number of clusters unknown. Further, Kmeans clustering with Silhouette statistic (KM/SilT and KM/SilR) was also conducted for comparison. The results for the 6 "highnoise" datasets are summarized below.
V/KL and V/BIC both performed poorly with all datasets, in most cases overpredicting the number of clusters. As can be seen in Figure 4, this overprediction tended to increase with dataset size N. V/BIC resulted in a lower overprediction than V/KL.
3.5. Test Results Conclusion
The VBEM algorithm can correctly cluster shapebased data even in the presence of fairly high amounts of noise, when paired with the Silhouette statistic performed on the raw data clusters (V/SilR). Further, V/SilR is robust in correctly predicting the number of clusters in noise. The misclassification rate is superior to Kmeans using Silhouette statistics, as well as VBEM using all other statistics. Because of this, it was expected that V/SilR would be the algorithm of choice for the experimental microarray data. However, to maintain comparison, all four VBEM/statistic algorithms were tested.
3.6. Experimental E. Coli Expression Data
The proposed approach for gene clustering on shape similarity was tested using timeseries data from the University of Oklahoma E. coli Gene Expression Database resident at their Bioinformatics Core Facility (OUBCF) [24]. The exploration concentrated on the wildtype MG1655 strain during exponential growth on glucose. The data available consisted of 5 timeseries logratio samples of 4389 genes.
The initial tests were run against genes identified as being from metabolic categories. Specifically, genes identified in the E. coli K12 Entrez Genome database at the National Center for Biotechnology Information, US National Library of Medicine, National Institutes of Health (http://www.ncbi.nlm.nih.gov/) [25] (NIH) as being in categories C, G, E, F, H, I, and/or Q were chosen.
Because of the shortsequence lengths, any gene with even a single invalid data point was removed from the set. With only 5time samples to work with in each gene sequence, even a single missing point would have significant ramifications in the final output. The final set of genes used for testing numbered 1309.
In implementing the VBEM algorithm, initial values for the algorithm were . The algorithm was set to iterate until the change in lower bound decreased below or became negative (which required the prior iteration to be taken as the end value) or 200 iterations, whichever came first. The optimal number of clusters was arrived at by multiple runs of the algorithm at values of K, the predefined number of clusters, varying from 3 to 15. was chosen in the same manner as in the test data sequences.
3.7. Validation of E. Coli Expression Data Results
We validated the results of our tests using Gene Ontology (GO) enrichment analysis. To this end, the genes used in the analysis were tagged with their respective GO categories and analyzed within each cluster for overrepresentation of certain categories versus the "background" level of the population (in this case, the entire set of metabolic genes used). Again, the Entrez Genome database at NIH was used for the GO annotation information. As most of the entries enriched were from the Biological Process portion of the ontology, the analysis was restricted to those terms.
To perform the analysis, the software package Cytoscape (http://www.cytoscape.org/) [26] was used. Cytoscape offers access to a wide variety of plugin analysis packages, including a GO enrichment analysis tool, BiNGO, which stands for Biological Network Gene Ontology (http://www.psb.ugent.be/cbd/papers/BiNGO/) [27].
To evaluate the clusters, we modified an approach used by Yuan and Li [28] to score the clusters based on the information content and the likelihood of enrichment ( ). Unlike [28], however, a distance metric was not included in the calculations. Because of the large cluster sizes involved, such distance calculations would have exacted a high calculation overhead. Rather, the simpler approach of forming subclusters of adjacent enriched terms was chosen; that is, if two GO terms had a relationship to each other and were both enriched, they were placed in the same subcluster and their scores multiplied by the number of terms in the subcluster. Also, a large portion of the score of any term shared across more than one cluster was subtracted. This method rewarded large subclusters, while penalizing numerous small subclusters and overlapping terms.
where is the probability of GO term being selected, is the negative of the information content of the GO term, and is the value of the GO term . Large subclusters are rewarded by larger values of . Subtracting 1 from compensates for the "baseline" score value; that is, the score a cluster would achieve if no terms were connected. The final term in the equation is the devaluation of any GO term shared by clusters.
Given that algorithm was expected to group related functions together, the expectation for GO analysis was the creation of large, highlyconnected subclusters within each main gene cluster. Ideally, one such subcluster would subsume the entire cluster; however, a small number of large subclusters within each cluster would validate the algorithm. The scoring equation (18) greatly rewards large, highlyconnected subclusters; in fact, given a cluster, the score is maximized by having all GO terms within that cluster be connected within a single subcluster.
Summary scores from E. coli data analysis
Cluster/algorithm  1  2  3  4  5  Total score  Average score 

VSil/R  153.14  2004.55  22129.80  24287.48  8095.83  
V/SilT  405.73  3.10  82.95  7343.89  7835.67  1958.92  
V/BIC  4.42  422.42  513.70  44.64  11196.16  12181.33  2436.27 
4. Conclusion
Four combinations of VBEM algorithm and cluster statistics were tested. One of these, VBEM combined with the Silhouette statistic performed on the raw data clusters, clearly outperformed the other three in both simulated and real data tests. This method definitely shows promise in clustering timeseries microarray data according to profile shape.
Appendices
A. Calculation of VBE Step
where : number of time samples; : number of genes (index ); , and all other parameters are calculated from the VBM step.
B. Calculation of VBM Step
where ; ; ; ; ; ; : NormalInverseGamma distribution; : Dirichlet distribution.
C. Calculation of Lower Bound
where and .
Declarations
Acknowledgment
This work is supported in part by NSF Grant CCF0546345. Dr. Tim Lilburn has been instrumental with his assistance and guidance.
Authors’ Affiliations
References
 Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: a survey. IEEE Transactions on Knowledge and Data Engineering 2004, 16(11):13701386. 10.1109/TKDE.2004.68View ArticleGoogle Scholar
 Asyali MH, Colak D, Demirkaya O, Inan MS: Gene expression profile classification: a review. Current Bioinformatics 2006, 1(1):5573. 10.2174/157489306775330615View ArticleGoogle Scholar
 BarJoseph Z, Gerber GK, Gifford DK, Jaakkola TS, Simon I: Continuous representations of timeseries gene expression data. Journal of Computational Biology 2003, 10(34):341356. 10.1089/10665270360688057View ArticleGoogle Scholar
 Ma P, CastilloDavis CI, Zhong W, Liu JS: A datadriven clustering method for time course gene expression data. Nucleic Acids Research 2006, 34(4):12611269. 10.1093/nar/gkl013View ArticleGoogle Scholar
 Rueda L, Bari A, Ngom A: Clustering timeseries gene expression data with unequal time intervals. Lecture Notes in Computer Science. In Transactions on Computational Systems Biology X. Volume 5410. Springer, Berlin, Germany; 2008:100123. 10.1007/9783540922735_6View ArticleGoogle Scholar
 Yuan Y, Li CT: Unsupervised clustering of gene expression time series with conditional random fields. Proceedings of the Inaugural IEEE International Conference on Digital EcoSystems and Technologies (DEST '07) Cairns, Australia February 2007 pp.571576.
 Kim K, Zhang S, Jiang K, et al.: Measuring similarities between gene expression profiles through new data transformations. BMC Bioinformatics 2007, 8, article 29: 114.Google Scholar
 Wen X, Fuhrman S, Michaels GS, et al.: Largescale temporal gene expression mapping of central nervous system development. Proceedings of the National Academy of Sciences of the United States of America 1998, 95(1):334339. 10.1073/pnas.95.1.334View ArticleGoogle Scholar
 MöllerLevet CS, Klawonn F, Cho KH, Yin H, Wolkenhauer O: Clustering of unevenly sampled gene expression timeseries data. Fuzzy Sets and Systems 2005, 152(1):4966. 10.1016/j.fss.2004.10.014View ArticleMathSciNetMATHGoogle Scholar
 Balasubramaniyan R, Hüllermeier E, Weskamp N, Kämper J: Clustering of gene expression data using a local shapebased similarity measure. Bioinformatics 2005, 21(7):10691077. 10.1093/bioinformatics/bti095View ArticleGoogle Scholar
 Phang TL, Neville MC, Rudolph M, Hunter L: Trajectory clustering: a nonparametric method for grouping gene expression time courses, with applications to mammary development. Proceedings of the 8th Pacific Symposium on Biocomputing (PSB '03) Lihue, Hawaii, USA 351362. January 2003
 Tjaden B: An approach for clustering gene expression data with error information. BMC Bioinformatics 2006, 7, article 17: 115.Google Scholar
 BenHur A, Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Proceedings of the 7th Pacific Symposium on Biocomputing (PSB '02) Lihue, Hawaii, USA January 2002 617.
 Dimitriadou E, Dolničar S, Weingessel A: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 2002, 67(1):137159. 10.1007/BF02294713View ArticleMathSciNetMATHGoogle Scholar
 Dudoit S, Fridlyand J: A predictionbased resampling method for estimating the number of clusters in a dataset. Genome Biology 2002, 3(7):121.View ArticleGoogle Scholar
 Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B 2001, 63(2):411423. 10.1111/14679868.00293View ArticleMathSciNetMATHGoogle Scholar
 Sun H, Sun M: Trailanderror approach for determining the number of clusters. Lecture Notes in Computer Science. 3930: Proceedings of the 4th International Conference on Machine Learning and Cybernetics (ICMLC '05) Guangzhou, China 229238.August 2006
 Wild DL, Rasmussen CE, Ghahramani Z: A Bayesian approach to modeling uncertainty in gene expression clusters. Proceedings of the 3rd International Conference on Systems Biology (ICSB '02) Stockholm, Sweden December 2002
 Xu Y, Olman V, Xu D: Minimum spanning trees for gene expression data clustering. Genome Informatics 2001, 12: 2433.Google Scholar
 Yan M, Ye K: Determining the number of clusters using the weighted gap statistic. Biometrics 2007, 63(4):10311037. 10.1111/j.15410420.2007.00784.xView ArticleMathSciNetMATHGoogle Scholar
 Beal MJ, Ghahramani Z: The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Proceedings of the 7th Valencia International Meeting on Bayesian Statistics Tenerife, Spain June 2003 7: 453464.
 Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 1987, 20: 5365. 10.1016/03770427(87)901257View ArticleMATHGoogle Scholar
 Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6(2):461464. 10.1214/aos/1176344136View ArticleMathSciNetMATHGoogle Scholar
 The University of Oklahoma's E. coli Gene Expression Database, http://chase.ou.edu/oubcf/
 The Entrez Genome Database. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health : Escherichia coli K12 data. http://www.ncbi.nlm.nih.gov/
 Shannon P, Markiel A, Ozier O, et al.: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 2003, 13(11):24982504. 10.1101/gr.1239303View ArticleGoogle Scholar
 Maere S, Heymans K, Kuiper M: BiNGO : a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 2005, 21(16):34483449. 10.1093/bioinformatics/bti551View ArticleGoogle Scholar
 Yuan Y, Li CT: Probabilistic framework for gene expression clustering validation based on gene ontology and graph theory. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '08) Las Vegas, Nev, USA MarchApril 2008 625628.
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.