As in any model-based clustering method, it is assumed that the gene expression data are random samples from some underlying distributions. All data in one cluster are generated by the same distribution. For most existing clustering algorithms, each gene is associated with a vector containing the expressions in all experiments. The clustering of the genes is based on their vectors. However, such approach ignores the fact that genes may show different functionalities under various experiment conditions, i.e., different clusters may be formed under different experiments. In order to cope with this phenomenon, we treat each expression separately. More specifically, we allow different expressions of the same individual gene to be generated by different statistical models.

Suppose that for the mircoarray data, there are

*N* genes in total. For each gene, we conduct

*M* experiments. Let

*g*_{
j
i
} denote the expression of the

*i* th gene in the

*j* th experiment, 1≤

*i*≤

*N*, and 1≤

*j*≤

*M*. For each

*g*_{
j
i
}, we associate a latent membership variable

*z*_{
j
i
}, which indicates the cluster membership of

*g*_{
j
i
}. That is, if genes

*i* and

*i*^{′} are in the same cluster under the conditions of experiments

*j* and

*j*^{′}, we have

${z}_{\mathit{\text{ji}}}={z}_{{j}^{\prime}{i}^{\prime}}$. Note that

*z*_{
j
i
} is supported on a countable set such as

$\mathbb{N}$ or

$\mathbb{Z}$. For each

*g*_{
j
i
}, we associate a coefficient

${\theta}_{{z}_{\mathit{\text{ji}}}}$, whose index is determined by its membership variable

*z*_{
j
i
}. In order to have a Bayesian approach, we also assume that each coefficient

*θ*_{
k
} is drawn independently from a prior distribution

*G*_{0}$\begin{array}{l}{\theta}_{k}\sim {G}_{0},\end{array}$

(1)

where *k* is determined by *z*_{
j
i
}.

The membership variable

**z**={

*z*_{
j
i
}}

_{j,i} has a discrete joint distribution

$\begin{array}{l}\mathbf{z}\sim \mathrm{\Pi .}\end{array}$

(2)

Note that in this article, the bold-face letter always refers to a set formed by the elements with specified indices.

We assume that each

*g*_{
j
i
} is drawn independently from a distribution

$F({\theta}_{{z}_{\mathit{\text{ji}}}})$$\begin{array}{l}{g}_{\mathit{\text{ji}}}\sim F\left({\theta}_{{z}_{\mathit{\text{ji}}}}\right),\end{array}$

(3)

where

${\theta}_{{z}_{\mathit{\text{ji}}}}$ is a coefficient associated with

*g*_{
j
i
} and

*F* is a distribution family such as the Gaussian distribution family. In summary, we have the following model for the expression data

$\begin{array}{cc}{\theta}_{k}& \sim {G}_{0}\\ \mathbf{z}& \sim \Pi \\ {g}_{\mathit{\text{ji}}}|{z}_{\mathit{\text{ji}}},{\theta}_{k}& \sim F\left({\theta}_{{z}_{\mathit{\text{ji}}}}\right).\end{array}$

(4)

The above model is a relatively general one which can induce many previous models. For example, in all Bayesian approaches, all variables are assigned with proper priors. It is very popular to use the mixture model as the prior, which models the data generated by a mixture of distributions, e.g., a linear combination of a family of distributions such as Gaussian distributions. Each cluster is generated by one component in the mixture distribution given the membership variable [14]. The above approach corresponds to our model if we assume that *Π* is finitely supported and *F* is Gaussian.

The aim for clustering is to determine the posterior probability of the latent membership variables given the observed gene expressions

$\begin{array}{l}P(\mathbf{z}|\mathbf{g}),\end{array}$

(5)

where **g**={*g*_{
j
i
}}_{j,i}.

As a clustering algorithm, the final result is given in the forms of clusters. Each gene has to be assigned to one and only one cluster. Once we have the inference result in (5), we can apply the maximum

*a posterior* criterion to obtain an estimate of membership variable

${\widehat{z}}_{\xb7i}$ for the

*i* th gene as

$\begin{array}{l}{\widehat{z}}_{\xb7i}={arg}_{a}\phantom{\rule{1pt}{0ex}}max\sum _{j}P({z}_{\mathit{\text{ji}}}=a|\mathbf{g}).\end{array}$

(6)

We note that in case one is interested in finding other related clusters for one gene, we can simply use the inferred distribution to membership variable to obtain this information.

### 2.1 Dirichlet processes and infinite mixture model

Instead of assuming a fixed number of clusters *a priori*, one can assume infinite number of clusters to avoid the estimation accuracy problem on the number of clusters as we mentioned earlier. Correspondingly in (4), the prior *Π* is an infinite discrete distribution. Again as in the Bayesian fashion, we will introduce priors for all parameters. The Dirichlet process is one such prior. It can be viewed as a random measure [15], i.e., the domain of this process (viewed as a measure) is a collection of probability measures. In this section, we will give a brief introduction to the Dirichlet process which serves as the vital prior part in our HDP model.

Recall that the Dirichlet distribution

$\mathcal{D}({u}_{1},\dots ,{u}_{K})$ of order

*K* on a (

*K*−1)-simplex in

${\mathbb{R}}^{K-1}$ with parameter

*u*_{1},…,

*u*_{
K
} is given by the following probability density function

$\mathcal{D}({x}_{1},\dots ,{x}_{K-1};{u}_{1},\dots ,{u}_{K})=\frac{\Gamma \left({\sum}_{i=1}^{K}{u}_{i}\right)}{{\prod}_{i=1}^{K}\Gamma ({u}_{i})}\prod _{i=1}^{K}{{x}_{i}}^{{u}_{i}-1}$

(7)

where ${\sum}_{i=1}^{K}{x}_{i}=1,{u}_{i}>0,i=1,\dots ,K,$ and *Γ*(·) is the Gamma function. Since every point in the domain is a discrete probability measure, the Dirichlet distribution is a random measure in the finite discrete probability space.

The Dirichlet processes are the generalization of the Dirichlet distribution into the continuous space. There are various constructive or non-constructive definitions of Dirichlet processes. For simplicity, we use the following non-constructive definition.

Let (

*X*,

*σ*,

*μ*_{0}) be a probability space. A Dirichlet process

*D*(

*α*_{0},

*μ*_{0}) with parameter

*α*_{0}>0 is defined as a random measure: for any non-trivial finite partition (

*χ*_{1},…,

*χ*_{
r
}) of

*X* with

*χ*_{
i
}∈

*σ*, we have the random variable

$(\mathcal{G}({\chi}_{1}),\dots ,\mathcal{G}({\chi}_{r}))\sim \mathcal{D}({\alpha}_{0}{\mu}_{0}({\chi}_{1}),\dots ,{\alpha}_{0}{\mu}_{0}({\chi}_{r})),$

(8)

where $\mathcal{G}$ is drawn from *D*(*α*_{0},*μ*_{0}).

The Dirichlet processes can be characterized in various ways [15] such as the stick-breaking construction [22] and the Chinese restaurant process [23]. The Chinese restaurant process serves as a visualized characterization of the Dirichlet process.

Let *x*_{1},*x*_{2},… be a sequence of random variables drawn from the Dirichlet process *D*(*α*_{0},*μ*_{0}). Although we do not have the explicit formula for *D*, we would like to know the conditional probability of *x*_{
i
} given *x*_{1},…,*x*_{i−1}. In the Chinese restaurant model, the data can be viewed as customers sequentially entering a restaurant with infinite number of tables. Each table corresponds to a cluster with unlimited capacity. Each customer *x*_{
i
} entering the restaurant will join in the table already taken with equal probability. In addition, the new customer may sit in a new table with probability proportional to *α*_{0}. Tables that have already been occupied by customers tend to gain more and more customers.

One remarkable property of the Dirichlet process is that although it is generated by a continuous process, it is discrete (countably many) almost surely [15]. In other words, almost every sample distribution drawn from the Dirichlet process is a discrete distribution. As a consequence, the Dirichlet process is suitable to serve as a non-parametric prior of the infinite mixture model.

The Dirichlet mixture model uses the Dirichlet process as a prior. The model in (4) can then be represented as follows:

$\begin{array}{c}{g}_{\mathit{\text{ji}}}|{z}_{\mathit{\text{ji}}},{\theta}_{k}\sim F({\theta}_{{z}_{\mathit{\text{ji}}}});\end{array}$

(9)

*θ*_{
k
} is generated by the measure

*μ*_{0}$\begin{array}{c}{\theta}_{k}\sim {\mu}_{0};\end{array}$

(10)

{

*z*_{
j
i
}} is generated by a Dirichlet process

*D*(

*α*_{0},

*μ*_{0})

$\begin{array}{c}\left\{{z}_{\mathit{\text{ji}}}\right\}\sim D({\alpha}_{0},{\mu}_{0}).\end{array}$

(11)

Recall that *D*(*α*_{0},*μ*_{0}) is discrete almost everywhere, which corresponds to the indices of the clusters.

### 2.2 HDP model

Biological data such as the expression data often exhibit hierarchical structures. For example, although clusters can be formed based on similarities, some clusters may still share certain similarities among themselves at different levels of perspectives. Within one cluster, the genes may share similar features. But on the level of clusters, one cluster may share some similar feature with some other clusters. Many traditional clustering algorithms typically fail to recognize such hierarchical information and are not able to group these similar clusters into a new cluster, producing many fragments in the final clustering result. As a consequence, it is difficult to interpret the functionalities and meanings of these fragments. Therefore, it is desirable to have an algorithm that is able to cluster among clusters. In other words, the algorithm should be able to cluster based on multiple features at different levels. In order to capture the hierarchical structure feature of the gene expressions, we now introduce the hierarchical model to allow clustering at different levels. The clustering algorithm based on the hierarchical model not only reduces the number of cluster fragments, but also may reveal more details about the unknown functionalities of certain genes as the clusters sharing multiple features.

Recall that in the statistical model (11), the clustering effect is induced by the Dirichlet process

*D*(

*α*_{0},

*μ*_{0}). If we need to take into account different level of clusters, it is natural to introduce a prior with clustering effect to the base measure

*μ*_{0}. Again in this case, the Dirichlet process can serve as such prior. The intuition is that given the base measure, the clustering effect is represented through a Dirichlet process on the single gene level. By the Dirichlet process assumption on the base measure, the base measure also exhibits the clustering effect, which leads to clustering at cluster level. We simply set the prior to the base measure

*μ*_{0} as

$\begin{array}{c}{\mu}_{0}\sim {D}_{1}({\alpha}_{1},{\mu}_{1}),\end{array}$

(12)

where *D*_{1}(*α*_{1},*μ*_{1}) is another Dirichlet process. In this article, we use the same letter for the measure, the distribution it induces, and the corresponding density function as long as it is clear from the context. Moreover, we could extend the hierarchies to as many levels as we wish at the expense of complexity of the inference algorithm. The desired number of hierarchies can be determined by the prior biological knowledge. In this article, we focus on a two-level hierarchy.

As a remark, we would like to point out the connection and difference on the “hierarchy” in the proposed HDP method and traditional HC [4]. Both the HDP and HC algorithms can provide HC results. The hierarchy in the HDP method is manifested by the Chinese restaurant process which will be introduced later, where the data sit in the same table can be viewed as the first level and all tables sharing the same dish can be viewed as the second level. While the hierarchy in the HC is obtained by merging existing clusters based on their distances. However, its specific merging strategy is heuristic and is irreversible for those merged clusters. Hierarchy formed in this fashion often may not reflect the true structure in the data since various hierarchical structures can be formed by choosing different distance metrics. However, the HDP algorithm captures the hierarchical structure at the model level. The merging is carried out automatically during the inference. Therefore, it naturally takes the hierarchy into consideration.

In summary, we have the following HDP model for the data:

$\begin{array}{rcl}{\mu}_{0}& \sim & {D}_{1}({\alpha}_{1},{\mu}_{1})\\ \left\{{z}_{\mathit{\text{ji}}}\right\}|{\mu}_{o},{\alpha}_{0}& \sim & D({\alpha}_{0},{\mu}_{0})\\ {\alpha}_{0},{\alpha}_{1}& \sim & \Gamma (a,b)\\ {\theta}_{k}& \sim & {\mu}_{1}\\ {g}_{\mathit{\text{ji}}}|{z}_{\mathit{\text{ji}}},{\theta}_{k}& \sim & F({\theta}_{{z}_{\mathit{\text{ji}}}}),\end{array}$

(13)

where *a* and *b* are some fixed constants. We assume that *F* and *μ*_{1} are conjugate priors. In this article, *F* is assumed to be the Gaussian distribution and *μ*_{1} is the inverse Gamma distribution.