Hierarchical Dirichlet process model for gene expression clustering
 Liming Wang^{1} and
 Xiaodong Wang^{2}Email author
DOI: 10.1186/1687415320135
© Wang and Wang; licensee Springer. 2013
Received: 17 October 2012
Accepted: 11 March 2013
Published: 12 April 2013
Abstract
Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments.
1 Introduction
The microarray technology has enabled the possibility to monitor the expression levels of thousands of genes in parallel under various conditions [1]. Due to the highvolume nature of the microarray data, one often needs certain algorithms to investigate the gene functions, regulation relations, etc. Clustering is considered to be an important tool for analyzing the biological data [2–4]. The aim of clustering is to group the data into disjoint subsets, where in each subset the data show certain similarities to each other. In particular, for microarray data, genes in each clustered group exhibit correlated expression patterns under various experiments.
Several clustering methods have been proposed, most of which are distancebased algorithms. That is, a distance is first defined for clustering purpose and then the clusters are formed based on the distances of the data. Typical algorithms in this category include the Kmeans algorithm [5] and the selforganizing map (SOM) algorithm [6]. These algorithms are based on simple rules, and they often suffer from robustness issue, i.e., they are sensitive to noise which is extensive in biological data [7]. For example, the SOM algorithm requires user to provide number of clusters in advance. Hence, incorrect estimation of the parameter may provide wrong result.
Another important category of clustering methods is the modelbased algorithms. These algorithms employ a statistical approach to model the structure of clusters. Specifically, data are assumed to be generated by some mixture distribution. Each component of the mixture corresponds to a cluster. Usually, the parameters of the mixture distribution are estimated by the EM algorithm [8]. The finitemixture model [9–11] assumes that the number of mixture components is finite and the number can be estimated using the Bayesian information criterion [12] or the Akaike information criterion [13]. However, since the estimation of the number of clusters and the estimation of the mixture parameters are performed separately, the finitemixture model may be sensitive to the different choices of the number of clusters [14].
The infinitemixture model has been proposed to cope with the above sensitivity problem of the finitemixture model. This model does not assume a specific number of components and is primarily based on the Dirichlet processes [15, 16]. The clustering process can equivalently be viewed as a Chinese restaurant process [17], where the data are considered as customers entering a restaurant. Each component corresponds to a table with infinite capacity. A new customer joins a table according to the current assignment of seats.
Hierarchical clustering (HC) is yet another more advanced approach especially for biological data [18], which groups together the data with similar features based on the underlying hierarchical structure. The biological data often exhibit hierarchical structure, e.g., one cluster may highly be overlapped or could be embedded into another cluster [19]. If such hierarchical structure is ignored, the clustering result may contain many fragmental clusters which could have been combined together. Hence, for biological data, such HC has its advantages to many traditional clustering algorithms. The performances of such HC algorithms depend highly on the quality of the data and the specific agglomerative or divisive ways the algorithms use for combining clusters.
In this article, we propose a modelbased clustering algorithm for gene expression data based on the hierarchical Dirichlet process (HDP) [21]. The HDP model incorporates the merits of both the infinitemixture model and the HC. The hierarchical structure is introduced to allow sharing data among related clusters. On the other hand, the model uses the Dirichlet processes as the nonparametric Bayesian prior, which do not assume a fixed number of clusters a priori.
The remainder of the article is organized as follows. In Section 2, we introduce some necessary mathematical background and formulate the HC problem as a statistical inference problem. In Section 3, we derive a Gibbs samplerbased inference algorithm based on the Chinese restaurant metaphor of the HDP model. In Section 4, we provide experimental results of the proposed HDP algorithm for two applications, regulatory network segmentation and gene expression clustering. Finally, Section 5 concludes the article.
2 System model and problem formulation
As in any modelbased clustering method, it is assumed that the gene expression data are random samples from some underlying distributions. All data in one cluster are generated by the same distribution. For most existing clustering algorithms, each gene is associated with a vector containing the expressions in all experiments. The clustering of the genes is based on their vectors. However, such approach ignores the fact that genes may show different functionalities under various experiment conditions, i.e., different clusters may be formed under different experiments. In order to cope with this phenomenon, we treat each expression separately. More specifically, we allow different expressions of the same individual gene to be generated by different statistical models.
where k is determined by z_{ j i }.
Note that in this article, the boldface letter always refers to a set formed by the elements with specified indices.
The above model is a relatively general one which can induce many previous models. For example, in all Bayesian approaches, all variables are assigned with proper priors. It is very popular to use the mixture model as the prior, which models the data generated by a mixture of distributions, e.g., a linear combination of a family of distributions such as Gaussian distributions. Each cluster is generated by one component in the mixture distribution given the membership variable [14]. The above approach corresponds to our model if we assume that Π is finitely supported and F is Gaussian.
where g={g_{ j i }}_{j,i}.
We note that in case one is interested in finding other related clusters for one gene, we can simply use the inferred distribution to membership variable to obtain this information.
2.1 Dirichlet processes and infinite mixture model
Instead of assuming a fixed number of clusters a priori, one can assume infinite number of clusters to avoid the estimation accuracy problem on the number of clusters as we mentioned earlier. Correspondingly in (4), the prior Π is an infinite discrete distribution. Again as in the Bayesian fashion, we will introduce priors for all parameters. The Dirichlet process is one such prior. It can be viewed as a random measure [15], i.e., the domain of this process (viewed as a measure) is a collection of probability measures. In this section, we will give a brief introduction to the Dirichlet process which serves as the vital prior part in our HDP model.
where ${\sum}_{i=1}^{K}{x}_{i}=1,{u}_{i}>0,i=1,\dots ,K,$ and Γ(·) is the Gamma function. Since every point in the domain is a discrete probability measure, the Dirichlet distribution is a random measure in the finite discrete probability space.
The Dirichlet processes are the generalization of the Dirichlet distribution into the continuous space. There are various constructive or nonconstructive definitions of Dirichlet processes. For simplicity, we use the following nonconstructive definition.
where $\mathcal{G}$ is drawn from D(α_{0},μ_{0}).
The Dirichlet processes can be characterized in various ways [15] such as the stickbreaking construction [22] and the Chinese restaurant process [23]. The Chinese restaurant process serves as a visualized characterization of the Dirichlet process.
Let x_{1},x_{2},… be a sequence of random variables drawn from the Dirichlet process D(α_{0},μ_{0}). Although we do not have the explicit formula for D, we would like to know the conditional probability of x_{ i } given x_{1},…,x_{i−1}. In the Chinese restaurant model, the data can be viewed as customers sequentially entering a restaurant with infinite number of tables. Each table corresponds to a cluster with unlimited capacity. Each customer x_{ i } entering the restaurant will join in the table already taken with equal probability. In addition, the new customer may sit in a new table with probability proportional to α_{0}. Tables that have already been occupied by customers tend to gain more and more customers.
One remarkable property of the Dirichlet process is that although it is generated by a continuous process, it is discrete (countably many) almost surely [15]. In other words, almost every sample distribution drawn from the Dirichlet process is a discrete distribution. As a consequence, the Dirichlet process is suitable to serve as a nonparametric prior of the infinite mixture model.
Recall that D(α_{0},μ_{0}) is discrete almost everywhere, which corresponds to the indices of the clusters.
2.2 HDP model
Biological data such as the expression data often exhibit hierarchical structures. For example, although clusters can be formed based on similarities, some clusters may still share certain similarities among themselves at different levels of perspectives. Within one cluster, the genes may share similar features. But on the level of clusters, one cluster may share some similar feature with some other clusters. Many traditional clustering algorithms typically fail to recognize such hierarchical information and are not able to group these similar clusters into a new cluster, producing many fragments in the final clustering result. As a consequence, it is difficult to interpret the functionalities and meanings of these fragments. Therefore, it is desirable to have an algorithm that is able to cluster among clusters. In other words, the algorithm should be able to cluster based on multiple features at different levels. In order to capture the hierarchical structure feature of the gene expressions, we now introduce the hierarchical model to allow clustering at different levels. The clustering algorithm based on the hierarchical model not only reduces the number of cluster fragments, but also may reveal more details about the unknown functionalities of certain genes as the clusters sharing multiple features.
where D_{1}(α_{1},μ_{1}) is another Dirichlet process. In this article, we use the same letter for the measure, the distribution it induces, and the corresponding density function as long as it is clear from the context. Moreover, we could extend the hierarchies to as many levels as we wish at the expense of complexity of the inference algorithm. The desired number of hierarchies can be determined by the prior biological knowledge. In this article, we focus on a twolevel hierarchy.
As a remark, we would like to point out the connection and difference on the “hierarchy” in the proposed HDP method and traditional HC [4]. Both the HDP and HC algorithms can provide HC results. The hierarchy in the HDP method is manifested by the Chinese restaurant process which will be introduced later, where the data sit in the same table can be viewed as the first level and all tables sharing the same dish can be viewed as the second level. While the hierarchy in the HC is obtained by merging existing clusters based on their distances. However, its specific merging strategy is heuristic and is irreversible for those merged clusters. Hierarchy formed in this fashion often may not reflect the true structure in the data since various hierarchical structures can be formed by choosing different distance metrics. However, the HDP algorithm captures the hierarchical structure at the model level. The merging is carried out automatically during the inference. Therefore, it naturally takes the hierarchy into consideration.
where a and b are some fixed constants. We assume that F and μ_{1} are conjugate priors. In this article, F is assumed to be the Gaussian distribution and μ_{1} is the inverse Gamma distribution.
3 Inference algorithm
Under regularity conditions, the distribution of ${\left\{{z}_{\mathit{\text{ji}}}^{(l)}\right\}}_{j,i}$ will converge to the true posterior distribution in (5) [24]. The proposed Gibbs sampling algorithm is similar to the HDP inference algorithm proposed in [21], since both the Gibbs algorithms use the Chinese restaurant metaphor which we will elaborate later. However, because of the differences in modeling, we still need to provide details for the inference algorithm based on our model.
3.1 Chinese restaurant metaphor
The Chinese restaurant model [23] is a visualized characterization for interpreting the Dirichlet process. Because there is no explicit formula to describe the Dirichlet process, we will employ the Chinese restaurant model for HDP inference instead of directly computing the posterior distribution in (5). We refer to [23, 25] for the proof and other details of the equivalence between the Chinese restaurant metaphor and the Dirichlet processes.
where $\sum _{k}{d}_{\mathit{\text{jk}}}$ calculates the number of tables taken in the r th row and δ_{(·)} is the Kronecker delta function. The interpretation of (16) is that customer z_{ j i } chooses a table already taken with equal probability. In addition, z_{ j i } may choose a new table with probability proportional to α_{0}.
where ${\sum}_{j}{d}_{\mathit{\text{jk}}}$ counts the number of tables serving dish m_{ k }; ${\sum}_{\mathit{\text{jk}}}{d}_{\mathit{\text{jk}}}$ counts the number of tables serving dishes; K_{ j i } denotes the net number of dishes served till λ_{ j i }’s coming by counting only once each dish that has been served multiple times.
3.2 A Gibbs sampler for HDP inference
We can calculate the related conditional probabilities as follows.
The numerator of (20) is the joint density of the data which are generated by the same dish. By the assumption that ${g}_{{j}^{\prime}{i}^{\prime}}$ are conditionally independent given the chosen dish, we have the conditional density of the data in the product form. The denominator is the joint density excluding the specific g_{ j i } term. The integrals in (20) can either be calculated using the numerical method or using the Monte Carlo integration. For example, in order to calculate the following integral ${\int}_{a}^{b}f(x)p(x)\mathit{\text{dx}}$, where p(x) is a density function, we can draw samples x_{1},x_{2},…,x_{ n } from p(x) and approximate the integral by ${\int}_{a}^{b}f(x)p(x)\mathit{\text{dx}}={E}_{p(x)}[f(x)]\approx \frac{1}{n}{\sum}_{i=1}^{n}f({x}_{i})$. To calculate (20), we view μ_{1}(·) as p(·) and $F({g}_{{j}^{\prime}{i}^{\prime}}\xb7)$ as f(·).
The derivations of (19), (21), (22), and (23) are given in Appendix.
Before we present the Gibbs sampling algorithm, we recall the Metropolis–Hastings (M–H) algorithm [26] for drawing samples from a target distribution whose density function f(x) is only known up to a scaling factor, i.e., f(x)∝p(x). To draw samples from f(x), we make use of some fixed conditional distribution q(x_{2}x_{1}) that satisfies q(x_{2}x_{1})=q(x_{1}x_{2}), ∀x_{1},x_{2}. The M–H algorithm proceeds as follows.

Start with an arbitrary value x_{0} with p(x_{0})>0.

For l=1,2,…
Given the previous sample x_{l−1}, draw a candidate sample x^{⋆} from q(x^{⋆}x_{l−1}).
Calculate $\beta =\frac{p({x}^{\star})}{p({x}_{l1})}$. If β≥1 then accept the candidate and let x_{ l }=x^{⋆}. Otherwise accept it with probability β, or reject it and accept the previous sample with probability 1−β.
After a “burnin” period, say l_{0}, the samples ${\left\{{x}_{l}\right\}}_{l>{l}_{0}}$ follow the distribution f(x).
We now summarize the Gibbs sampling algorithm for the HDP inference as follows.

Initialization: randomly assign the indices ${\mathit{\varphi}}^{(0)}=\left\{{\varphi}_{11}^{(0)},{\varphi}_{12}^{(0)},\dots \right\}$ and ${\mathit{\lambda}}^{(0)}=\left\{{\lambda}_{11}^{(0)},{\lambda}_{12}^{(0)},\dots \right\}$. Note that once we have all the indices, the counters {c_{ j i }} and {d_{ j k }} are also determined.

For l=1,2,…,l_{0}+L,
given by (19) and (21) using the M–H algorithm. We view the probability in (24) as the target density and choose q(··) to be a distribution supported on $\mathbb{N}$. For example, we can use $q(ij)=\frac{j}{{(j+1)}^{i}}$, $i,j\in \mathbb{N}$.
given by (22) and (23) using M–H algorithm. We view the probability in (25) as the target density and use q(··) as specified in the previous step.
Since P(α_{0}ϕ,λ,α_{1},g)=P(α_{0}) and P(α_{1}ϕ,λ,α_{0},g)=P(α_{1}), simply draw samples of ${\alpha}_{0}^{(l)}$ and ${\alpha}_{1}^{(l)}$ from their prior Gamma distributions.

Using the samples after the “burnin” period ${\left\{{\mathit{\varphi}}^{(l)},{\mathit{\lambda}}^{(l)}\right\}}_{l={l}_{0}+1}^{{l}_{0}+L}$ to calculate $\widehat{P}(\mathit{\varphi},\mathit{\lambda}\mathbf{g})$, which is given by$\phantom{\rule{12.0pt}{0ex}}\widehat{P}\left({\varphi}_{\mathit{\text{ji}}}=a,{\lambda}_{j{\varphi}_{\mathit{\text{ji}}}}=b\right)=\frac{{\sum}_{l={l}_{0}+1}^{{l}_{0}+L}\mathbf{1}\left\{{\varphi}_{\mathit{\text{ji}}}^{(l)}=a,{\lambda}_{j{\varphi}_{\mathit{\text{ji}}}^{(l)}}^{(l)}=b\right\}}{L},$(26)

where 1(·) is the indicator function. Determine the membership distribution P(zg) from the inferred joint distribution $\widehat{P}(\mathit{\varphi},\mathit{\lambda}\mathbf{g})$ by $P({z}_{\mathit{\text{ji}}}=a\mathbf{g})=\sum _{b}\widehat{P}({\lambda}_{\mathit{\text{jb}}}=a\mathbf{g},{\varphi}_{\mathit{\text{ji}}}=b)\widehat{P}({\varphi}_{\mathit{\text{ji}}}=b\mathbf{g})$.

Calculate the estimation of clustering index ${\widehat{z}}_{\xb7i}$ for the i th gene by ${\widehat{z}}_{\xb7i}=\underset{a}{arg}max\phantom{\rule{1pt}{0ex}}{\sum}_{j}P({z}_{\mathit{\text{ji}}}=a\mathbf{g})$.
3.3 A numerical example
In this section, we provide a simple numerical example to illustrate the proposed Gibbs sampler. Let us consider the case N=M=2, i.e., there are 2 genes and 2 experiments. Assume that the expressions are as g_{11}=0,g_{12}=1,g_{21}=−1, and g_{22}=2. We assume ${\mu}_{1}(\theta )\sim \mathcal{N}(0,1)$ and $F({g}_{\mathit{\text{ji}}}\theta )\sim \mathcal{N}(\theta ,1)$. For initialization, we set ${\varphi}_{11}^{(0)}=1,{\varphi}_{12}^{(0)}=2,{\varphi}_{21}^{(0)}=3,{\varphi}_{22}^{(0)}=4$; ${\lambda}_{1{\varphi}_{11}^{(0)}}^{(0)}=1,{\lambda}_{1{\varphi}_{12}^{(0)}}^{(0)}=1,{\lambda}_{2{\varphi}_{21}^{(0)}}^{(0)}=2,{\lambda}_{2{\varphi}_{22}^{(0)}}^{(0)}=2,$ and α 0(0)=α 1(0)=1.
Note that the above integral can be calculated either numerically or by using the Monte Carlo integration method.
Since $\beta =\frac{0.1483}{0.22971}\approx 0.6456<1$, we should accept this candidate sample ϕ_{11}=3 with a probability of 0.6456. After the burnin period, say the sample returned by the M–H algorithm is ϕ_{11}=4, then we update ${\varphi}_{11}^{(1)}=4$ and move on to draw samples of the remaining variables ϕ_{12}, ϕ_{21}, and ϕ_{22}.
Assuming that we obtain samples of ϕ^{(1)} as ${\varphi}_{11}^{(1)}=4,{\varphi}_{12}^{(1)}=1,{\varphi}_{21}^{(1)}=1,{\varphi}_{22}^{(1)}=2$. We next draw the sample λ^{(1)}. Given the initial value ${\lambda}_{1{\varphi}_{11}^{(1)}}=1$ and q(··) returns ${\lambda}_{1{\varphi}_{11}^{(1)}}=3$ as a candidate sample. By (22), we obtain $P\left({\lambda}_{1{\varphi}_{11}^{(1)}}^{(1)}=1{\mathit{\varphi}}^{(1)},{\mathit{\lambda}}_{1{\varphi}_{11}^{(1)}}^{(0)c},{\alpha}_{1}^{(0)},{\alpha}_{0}^{(0)},\mathbf{g}\right)\propto \left(\sum _{j}{d}_{j1}\right){f}_{1}\left({g}_{11}{\mathbf{g}}_{11}^{c}\right)$. Furthermore, we have ${\sum}_{j}{d}_{j1}=2$ and ${f}_{1}\left({g}_{11}{\mathbf{g}}_{11}^{c}\right)\approx 0.22971$ as calculated before.
By (23), we obtain $P\left({\lambda}_{1{\varphi}_{11}}^{(1)}=3{\mathit{\varphi}}^{(1)},{\mathit{\lambda}}_{1{\varphi}_{11}}^{(0)c},{\alpha}_{1}^{(0)},{\alpha}_{0}^{(0)},\mathbf{g}\right)\propto {\alpha}_{1}\int F({g}_{11}\theta ){\mu}_{1}(\theta )\mathrm{d\theta}$. Moreover, we have α_{1}=1 and $\int F({g}_{11}\theta ){\mu}_{1}(\theta )\mathrm{d\theta}\approx 0.28208$ as calculated before. So we have $\beta =\frac{0.28208}{2\ast 0.22971}\approx 0.614<1$. After the burnin period, assume that the M–H algorithm returns a sample ${\lambda}_{1{\varphi}_{11}^{(1)}}=2$, then update ${\lambda}_{1{\varphi}_{11}^{(1)}}^{(1)}=2$ and move on to sample the remaining λ variables as well as α_{0} and α_{1}.
After the burnin period of the whole Gibbs sampler, we can calculate the posterior joint distribution P(ϕ,λg) from the samples and determine the clusters following the last two steps in the proposed Gibbs sampling algorithm.
4 Experimental results
The HDP clustering algorithm proposed in this article can be employed for gene expression analysis or as a segmentation algorithm for gene regulatory network inference. In this section, we first introduce two performance measures for clustering, the Rand Index (RI) [27] and the Silhouette Index (SI) [28]. We compare the HDP algorithm to the support vector machine (SVM) algorithm for network segmentation on synthetic data. We then conduct various experiments on both synthetic and real datasets including the AD400 datasets [29], the yeast galactose datasets [30], yeast sporulation datasets [31], human fibroblasts serum datasets [32], and yeast cell cycle data [33]. We compare the HDP algorithm to the Latent Dirichlet allocation (LDA), MCLUST, SVM, Kmeans, Bayesian Infinite Mixture Clustering (BIMC) the HC [4, 14, 34–37] based on the performance measures and the functional relationships.
4.1 Performance measures
In order to evaluate the clustering result, we utilize two measures: RI [27] and SI [28]. The first index is used when a ground truth is known in priori and the second index is to measure the performance without any knowledge of the ground truth.
The RI is a measure of agreement between two clustering results. It takes a value between 0 and 1. The higher is the score, the higher agreements it indicates.
The SI is the average Silhouette distance among all data points. The value of SI lies in [−1,1] and higher score indicates better performance.
4.2 Network segmentation on synthetic data
In regulatory network inference, due to the large size of the network, it is often useful to perform a network segmentation. The segmented subnetworks usually have much less number of nodes than the original network, leading to faster and more accurate analysis of the original network [38]. Clustering algorithms can be employed for such segmentation purpose. However, traditional clustering algorithms often provide segmentation results either too fine or too coarse, i.e., the resulting subnetworks either contain too few genes or two many genes. In addition, the hierarchical structure of the network cannot be discovered by those algorithms. Thanks to its hierarchical model assumption, the HDP algorithm can provide better segmentation results. We demonstrate the segmentation application of HDP on a synthetic network and compare to the SVM algorithm which is widely used for clustering and segmentation.
4.3 AD400 data
The AD400 is a synthetic dataset proposed in [29], which is used to evaluate the clustering algorithm performance. The dataset is constituted by 400 genes with 10 time points. As the ground truth, the AD400 dataset has 10 clusters with each one containing 40 genes.
Clustering performance of LDA, SVM, MCLUST, Kmeans, HC, and HDP on the AD400 data
Algorithm  RI  SI  Number of clusters 

LDA  0.931  0.553  10.0 
SVM  0.929  0.493  11 
MCLUST  0.942  0.583  10 
Kmeans  0.895  0.457  10 
HC  0.916  0.348  9 
BIMC  0.935  0.571  10.0 
HDP  0.947  0.577  10.0 
4.4 Yeast galactose data
Clustering performance of LDA, MCLUST, SVM, and HDP on the yeast galactose data
Algorithm  Rand index  Number of clusters 

LDA  0.942  6.3 
SVM  0.954  5 
MCLUST  0.903  9 
HDP  0.973  3.8 
It is seen that the HDP algorithm performs the best among the three algorithms. Unlike the MCLUST and LDA algorithms which produce more clusters than 4, the average number of clusters given by the HDP algorithm is very closed to the “true” value 4. Compared to the SVM method, the HDP algorithm produces a result that is more similar to the “ground truth”, i.e., with the highest RI value.
4.5 Yeast sporulation data
Clustering performance of LDA, MCLUST, Kmeans, HC, BIMC, and HDP on the yeast sporulation data
Algorithm  SI  Number of clusters 

LDA  0.586  6.2 
MCLUST  0.577  6 
KMeans  0.324  8 
HC  0.392  7 
BIMC  0.592  6.1 
HDP  0.673  6.0 
From Table 3, we can see that the HDP has the highest SI score. It suggests that the clustering results provided by HDP are more compact and less separated than results from other algorithms. The Kmeans and HC algorithm suggest higher number of clusters. However, their SI scores indicate that their clusters are not as tight as other algorithms.
4.6 Human fibroblasts serum data
The human fibroblasts serum data consists of 8,613 genes with 12 time points [32]. Again a logarithmic transform has been applied to the data and genes without significant changes have been removed. The remaining dataset has 532 genes.
Clustering performance of LDA, MCLUST, Kmeans, HC, BIMC, and HDP on the human fibroblasts serum data
Algorithm  SI  Number of clusters 

LDA  0.298  9.4 
MCLUST  0.382  6 
KMeans  0.324  7 
HC  0.313  5 
BIMC  0.418  7.3 
HDP  0.452  6.4 
4.7 Yeast cell cycle data
Numbers of newly discovered genes in various functional categories by the proposed HDP clustering algorithm
Function categories  Number of newly discovered genes 

Cell cycle and DNA processing  20 
Protein synthesis  25 
Protein fate  4 
Cell fate  12 
Transcription  8 
Unclassified protein  57 
Note that in [14] a Bayesian model with infinite number of clusters is proposed based on the Dirichlet process. The model in [14] is a special case of the HDP model proposed in this article when there is only one hierarchy. In terms of discovering new gene functionalities, we find that the performances of the two algorithms are similar, as the method in [14] discovered 106 new genes compared to the result in [2]. However, by taking the hierarchical structure into account, the total number of clusters found by the HDP algorithm is significantly smaller than that given in [14] which is 43 clusters. The SI score for BIMC and HDP are 0.321 and 0.392, respectively. The HDP clustering consolidates many fragmental clusters, which may provide an easier way to interpret the clustering results.
List of newly discovered genes in various functional categories
Function categories  Genes 

YBL051c YBR136w YBL016w YDR200c YBR274w  
Cell cycle and DNA  YDR217c YLR314c YJL074c YJL095w YDR052c 
processing  YDL126c YCL016c YDL188c YAL040c YEL019c 
YER122c YLR035c YLR055c YML032c YMR078c  
Protein synthesis  YDR091c YGL103w YBR118w YBL057c YBR101c 
YBR181c YDL083c YDL184c YDR012w YDR172w  
YGL105w YGL129c YJL041w YJL125c YJR113c  
YLR185w YPL037c YPL048w YLR009w YHL001w  
YHL015w YHR011w YHR088w YDR450w YEL034w  
Protein fate  YAL016w YBL009w YBR044c YDL040c 
Cell fate  YAL040c YDL006w YDL134c YIL007c YJL187c 
YDL029w YDL035c YCR002c YBL105c YCR089w  
YER114c YEL023c  
Transcription  YAL021c YBL022c YCL051w YDR146c YIL084c 
YJL127c YJL164c YJL006c 
5 Conclusions
In this article, we have proposed a new clustering approach based on the HDP. The HDP clustering explicitly models the hierarchical structure in the data that is prevalent in biological data such as gene expressions. We have developed a statistical inference algorithm for the proposed HDP model based on the Chinese restaurant metaphor and the Gibbs sampler. We have applied the proposed HDP clustering algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to reveal more structural information of the data compared to popular algorithms such as SVM and MCLUST, by incorporating the hierarchical knowledge into the model.
Appendix
Derivation of formula (19) and (21)
Combining (35) and (37), we have (19).
Combining (36) and (39), we have (21).
Derivation of (22) nd (23)
Combining (43), (44), (45), and (46), we have (22) and (23).
Declarations
Authors’ Affiliations
References
 Schena M, Shalon D, Davis R, Brown P: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995,270(5235):467470. 10.1126/science.270.5235.467View Article
 Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D: A genomewide transcriptional analysis of the mitotic cell cycle. Mol. Cell 1998, 2: 6573. 10.1016/S10972765(00)801148View Article
 Hughes J, Estep P, Tavazoie S, Church G: Computational identification of cisregulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol 2000,296(5):12051214. 10.1006/jmbi.2000.3519View Article
 Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc. Natl. Acad. Sci 1998,95(25):1486314868. 10.1073/pnas.95.25.14863View Article
 MacQueen J: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. California: University of California Press; 1967:281297.
 Kohonen T: SelfOrganization and Associative Memory. New York: Springer; 1988.View Article
 Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: a survey. IEEE Trans. Knowledge Data Eng 2004,16(11):13701386. 10.1109/TKDE.2004.68View Article
 Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 1977, 39: 138.MathSciNet
 McLachlan G, Peel D: Finite Mixture Models. New York: WileyInterscience; 2000.View Article
 Fraley C, Raftery A, clustering Modelbased, analysis discriminant, Am densityestimation. J.: Stat. Assoc. 2002,97(458):611631. 10.1198/016214502760047131View Article
 Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Modelbased clustering and data transformations for gene expression data. Bioinformatics 2001,17(10):977987. 10.1093/bioinformatics/17.10.977View Article
 Schwarz G: Estimating the dimension of a model. Ann. Stat 1978,6(2):461464. 10.1214/aos/1176344136View Article
 Akaike H: A new look at the statistical model identification. IEEE Trans Autom. Control 1974,19(6):716723. 10.1109/TAC.1974.1100705MathSciNetView Article
 Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 2002,18(9):11941206. 10.1093/bioinformatics/18.9.1194View Article
 Ferguson T: A Bayesian analysis of some nonparametric problems. Ann. Stat 1973,1(2):209230. 10.1214/aos/1176342360MathSciNetView Article
 Neal R: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat 2000,9(2):249265.MathSciNet
 Pitman J: Some developments of the BlackwellMacQueen urn scheme. Lecture NotesMonograph Series 1996, 245267.
 Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Online Library; 1990.View Article
 Jiang D, Pei J, Zhang A: DHC: a densitybased hierarchical clustering method for time series gene expression data. In Proceedings of Third IEEE Symposium on Bioinformatics and Bioengineering. Bethesda: IEEE; 2003:393400.View Article
 Piatigorsky J: Gene Sharing and Evolution: The Diversity of Protein Functions. Cambridge: Harvard University Press; 2007.View Article
 Teh Y, Jordan M, Beal M, Blei D: Hierarchical Dirichlet processes. J. Am. Stat. Assoc 2006,101(476):15661581. 10.1198/016214506000000302MathSciNetView Article
 Sethuraman J: A constructive definition of Dirichlet priors. Stat. Sinica 1991, 4: 639650.MathSciNet
 Aldous D: Exchangeability and related topics. École d’Été de Probabilités de SaintFlour XIII 1985, 1198.View Article
 Casella G, George E: Explaining the Gibbs sampler. Am. Stat 1992,46(3):167174.MathSciNet
 Blackwell D, MacQueen J: Ferguson distributions via Pólya urn schemes. Ann. Stat 1973,1(2):353355. 10.1214/aos/1176342372MathSciNetView Article
 Brooks S: Markov chain Monte Carlo method and its application. J. R. Stat. Soc. Ser. D (The Statistician) 1998, 47: 69100. 10.1111/14679884.00117View Article
 Hubert L, Arabie P: Comparing partitions. J. Classif 1985, 2: 193218. 10.1007/BF01908075View Article
 Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math 1987, 20: 5365.View Article
 Yeung KY, Ruzzo WL: Principal component analysis for clustering gene expression data. Bioinformatics 2001,17(9):763774. 10.1093/bioinformatics/17.9.763View Article
 Yeung K, Medvedovic M, Bumgarner R: Clustering geneexpression data with repeated measurements. Genome Biol 2003,4(5):R34. 10.1186/gb200345r34View Article
 Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science 1998,282(5389):699705.View Article
 Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson J, Boguski MS: The transcriptional program in the response of human fibroblasts to serum. Science 1999,283(5398):8387. 10.1126/science.283.5398.83View Article
 Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998,9(12):3273.View Article
 Blei D, Ng A, Jordan M: Latent Dirichlet allocation. J. Mach. Learn. Res 2003, 3: 9931022.
 Fraley C, Raftery A: MCLUST: software for modelbased cluster analysis. J. Classif 1999,16(2):297306. 10.1007/s003579900058View Article
 Furey T, Cristianini N, Duffy N, Bednarski D, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000,16(10):906914. 10.1093/bioinformatics/16.10.906View Article
 Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat. Genetics 1999, 22: 281285. 10.1038/10343View Article
 Chung F, Lu L CBMS Lecture Series no. 107. In Complex Graphs and Networks. Providence: American Mathematical Society; 2006.
 Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J: Gene ontology: tool for the unification of biology. Nat. Genet 2000, 25: 2529. 10.1038/75556View Article
 Stanford University: Yeast cell cycle datasets http://genomewww.stanford.edu/cellcycle/data/rawdata
 Lukashin A, Fuchs R: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 2001,17(5):405414. 10.1093/bioinformatics/17.5.405View Article
 Mewes H, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2002, 30: 3134. 10.1093/nar/30.1.31View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.