Using Temporal Correlation in Factor Analysis for Reconstructing Transcription Factor Activities
© I. Pournara and L. Wernisch. 2008
Received: 1 October 2007
Accepted: 13 April 2008
Published: 17 April 2008
Skip to main content
© I. Pournara and L. Wernisch. 2008
Received: 1 October 2007
Accepted: 13 April 2008
Published: 17 April 2008
Two-level gene regulatory networks consist of the transcription factors (TFs) in the top level and their regulated genes in the second level. The expression profiles of the regulated genes are the observed high-throughput data given by experiments such as microarrays. The activity profiles of the TFs are treated as hidden variables as well as the connectivity matrix that indicates the regulatory relationships of TFs with their regulated genes. Factor analysis (FA) as well as other methods, such as the network component algorithm, has been suggested for reconstructing gene regulatory networks and also for predicting TF activities. They have been applied to E. coli and yeast data with the assumption that these datasets consist of identical and independently distributed samples. Thus, the main drawback of these algorithms is that they ignore any time correlation existing within the TF profiles. In this paper, we extend previously studied FA algorithms to include time correlation within the transcription factors. At the same time, we consider connectivity matrices that are sparse in order to capture the existing sparsity present in gene regulatory networks. The TFs activity profiles obtained by this approach are significantly smoother than profiles from previous FA algorithms. The periodicities in profiles from yeast expression data become prominent in our reconstruction. Moreover, the strength of the correlation between time points is estimated and can be used to assess the suitability of the experimental time interval.
Genes are transcribed into mRNAs which in turn are translated into proteins. Some of these proteins activate or inhibit, as transcription factors (TFs), the transcription of a number of other genes creating a complex gene regulatory network. The number of transcription factors is believed to be much smaller than the number of regulated genes. Moreover, most genes are known to be regulated only by a very restricted number of transcription factors. This induces a sparse connectivity matrix for the representation of the connections between the TFs and the regulated genes. Microarray experiments measure the expression level of thousands of genes simultaneously. Unfortunately, a similar method that would allow us to measure simultaneously the abundance or activities of a larger number of proteins that act as TFs is not yet available. Some progress has been made with measurements of protein abundance by flow cytometry  following a dozen or so proteins of interest which need to be identified in advance. Still, such experiments are less available than gene expression experiments and cannot compete in terms of the number of tracked genes. ChIP-on-chip experiments, on the other hand, provide only static binding information about transcription factors. Thus, current approaches that use microarray experiments make a strong assumption: the protein levels are proportional to the mRNA levels. This assumption is not necessarily true due to the complexity of transcription, translation, and posttranslation modification. In more recent studies, two-level networks have been studied with hidden profiles of the transcription factors at the top level and the observed expression levels of the regulated genes at the lower level. Some of these studies [2–4] are concerned with factor analysis algorithms.
Factor analysis (FA) is often used as a dimensionality reduction approach assuming that the large number of observed variables becomes uncorrelated given a much smaller number of hidden variables called factors. Some of the advantages of FA over principle component analysis are the incorporation of independent additive measurement errors on the observed variables, the identification of an underlying structure, and the assignment of the factors as defined entities (in our case transcription factors). Finally, in contrast to independent component analysis, the factors are not assumed to be statistically independent.
In a recent paper, , we examined the suitability of five FA algorithms for reconstructing both gene regulatory networks and TF activity profiles. We showed that FA faithfully reconstructs TF activity profiles as other more widely known reconstruction approaches do such as network component analysis (NCA)  (see also ) and a piecewise least-square (plsgenomics) algorithm . The advantage of FA analysis over these algorithms is the ability to also reconstruct the connectivity matrix. NCA and plsgenomics rely heavily on the availability of connectivity information. Nonzero positions in the connectivity matrix, which describes the connections between the factors and the genes, need to be specified in advance. The algorithms then estimate the values at these positions which actually might turn out to be zero as well. This is a strong limitation since often only little information about genes regulated by specific TFs is available. A further advantage of the Bayesian FA models is that any information from literature surveys, ChIP-on-chip experiments, or sequence analyses about the underlying structure can be easily incorporated through priors.
A serious concern regarding the currently applied FA algorithms as well as NCA and plsgenomics algorithms is the lack of incorporation of any time information provided by the experiments. Actually, most available microarray data such as the E. coli, yeast, and Arabidopsis data are obtained from time series experiments. Unfortunately, the present time correlation within the TFs is ignored in the above algorithms. Time information can act as a smoothing approach on the TF profiles and thus can improve the reconstruction process. As in our previous paper, here we are still concerned with sparse connectivity matrices, but we also aim to include time correlation within the factors. For this purpose, we extend the algorithm by Fokoué and Titterington. , which performed well and was computationally efficient in our comparison , to handle time correlation information. However, the extensions we suggest can be easily applied to alternative FA algorithms analysed in . If we allowed a general form for the correlation matrix between the factors, we would run into the problem of estimating a large number of unknown parameters given only a small number of data points. We investigated a number of possible correlation structures and present one that performs well on gene regulatory networks in this paper.
Other algorithms such as the linear dynamic systems or Kalman filter models have also been suggested for estimation of the parameters of a time series model with hidden states. Ghahramani and Hinton  presented an EM algorithm for the estimation of the parameters of linear dynamical systems. This is an extension of the factor analysis algorithm  that was evaluated in our previous paper and performed less well than some alternative FA algorithms, in particular a Bayesian version. A Bayesian version of an FA algorithm allows one to use sparsity priors on the connectivity matrix and also to integrate prior information regarding the system under study. More recently, Beal et al.  presented a state space model for the reconstruction of transcriptional networks from gene expression time series data. The focus of their algorithm is to reconstruct a complete regulatory interaction network and not only the connectivity between TFs and genes. Thus, the hidden states do not represent TFs but any hidden variables that can not be directly measured by gene expression experiments such as missing genes, protein activity profiles, and protein degradation.
Barenco et al.  reconstruct the transcription factor activity of p53 from time series expression profiles of known target genes and a differential equation model of gene induction. Based on the same data set and a similar induction model, Sanguinetti et al.  suggest the usage of Gaussian processes to estimate the activity profile of the p53 transcription factor. In both cases, only one profile is reconstructed, albeit in great detail, and with an assumed knowledge of the dependent genes.
In this paper, we show how to incorporate time information in the factor analysis approach. Factor analysis is attractive, since it is on of the most straightforward ways to link hidden transcription factor activities to observed outputs without knowledge of the connectivity. However, time series information is ignored in all the methods discussed in our previous paper. Here, we explore an extension to factor analysis that integrates time series correlation. Since some data might show very little correlation or none at all, we estimate the posterior distribution of the strength of correlation of TF activities from one time point to the next. This information is useful in several respects as we show for gene expression data for E. coli from  and for yeast from Spellman et al. . Based on these datasets, we highlight some important points: (a) the correlation parameter within the factors reveals whether the time step during experimental sampling is large or small in relation to gene regulatory processes, and what the effect of this choice has on the reconstruction process; (b) our analysis also indicates that data obtained under different experimental conditions can show quite different dynamics as reflected in the correlation, and caution is required when combining such data sets for joint inference of regulatory relationships.
In this section, we describe the general factor analysis model and how time correlation information enters the model. We discuss how the model incorporates the sparsity of the connectivity matrix and discuss identifiability problems associated with factor analysis.
where , , , and with an dimensional row vector of ones. is called the loadings matrix or connectivity matrix with each of its entries indicating the relationship between a TF and a gene, for example, means that there is no interaction between gene and TF . For the rest of the paper, we assume , that is, data are centralised prior to the analysis.
where tr is the trace, the sum of the diagonal elements. In Section 5, we discuss in detail the prior and posterior probabilities of the parameters , and , as well as algorithms for their estimation.
where the matrix ( dimensions) is known as the between factors or spatial covariance matrix, which captures the correlation between the factors. We assign to the identity matrix to avoid scaling problems (see Section 2.3 below). The matrix ( dimensions) is known as the within factors or temporal covariance matrix. We assign 1 to the diagonal entries of this matrix so that it can be treated as a correlation matrix. The matrix captures the time correlation of each factor across the time points. We assign a first-order Markov structure to , thus an extra parameter that indicates the correlation between neighbouring time points must be estimated. The prior and posterior distributions of matrix as well as of the other parameters are discussed in Section 5.
Most genes are regulated by a small number of transcription factors. This implies that the connectivity matrix is sparse. To incorporate this sparsity in the factor analysis model, we assign independent Gaussian priors to each element of with mean zero and variance . A Gamma prior is assigned to each leading to a Student t-distribution for each row of . Such distribution assigns most probability mass to the origin and along the spines, where one of the coefficients is zero, that is, this prior favors connectivity structures in which genes are regulated by a small number of TFs.
There is an identifiability problem associated with (1). For a dimensional orthogonal matrix (i.e., ), we have with . Hence, it is not possible to distinguish between and all its possible orthogonal transformations based on knowledge of the product alone. However, with the objective of constructing a sparse connectivity matrix the number of possible orthogonal transformations is highly restricted.
Orthogonal transformations also include permutations of the factors which, in the case of regulatory networks, limit our ability to assign factors to known TFs. Sabatti and James.  constrain the factors to known TFs by assigning a priori loadings matrix with a large number of zeros based on the Vocabulon algorithm . Here, we aim to see if the FA algorithm is also able to reconstruct the activity profiles without any prior information. To assess how well profiles are reconstructed, we match inferred factors to TFs using profiles from the NCA algorithm, which requires prior knowledge of the regulatory relationships. For the matching of profiles, we use the Hungarian algorithm  described in Section 5.
In the FA algorithm, we utilise a Gibbs sampling algorithm in order to estimate the unknown parameters. In each iteration of the algorithm, it is possible that factors change sign or change position in the factors matrix. This is a widely known problem in FA. Here, we suggest the use of the Hungarian matching algorithm to identify factors that have changed sign or position in the factors matrix.
We analysed two-gene expression datasets, one from E. coli and one from yeast, both from time series experiments.
We have previously analysed the E. coli dataset using FA algorithms . The reconstructed profiles were rugged since the temporal information was ignored. Here, we reanalyse the data (consisting of 16 TFs, and 100 gene expression levels over 24 time points) using the new algorithm, and we compare the results with the previous ones in . We first approach the problem by supplying the connectivity matrix used in . This connectivity matrix was constructed based on RegulonDB and the available literature, and we will refer to it as the Kao connectivity matrix. Note that the NCA algorithm (or GNCA, a variation of it) requires such information in order to reconstruct the factor profiles and to estimate the nonzero values of the factor loadings matrix. Each algorithm was run 10 times. For the GNCA algorithm, we consider the run with the least mean-squared error, while for the FA algorithms we consider the average of these runs.
We also analysed the influence of the prior on the reconstruction process by providing prior loadings matrices that are denser than the Kao connectivity matrix, that is, we run the FA algorithm with decreasing numbers of given zero position in the prior loadings matrix. We tested priors where 75, 50, 25, or 5 entries per TF were fixed to zero. These positions were chosen randomly out of the approximately 75 zero positions that are present for each TF in the Kao connectivity matrix. We found that even with 50 zero entries out of the 75 entries that are present in Kao connectivity matrix the FA algorithms can reconstruct the same TF activity profiles equally well without having an identifiability problem regarding the labelling of the factors. However, as the number of constraints to zero in the prior loadings matrix is reduced further, assigning factors to TFs becomes more difficult, although the reconstruction of the TF profiles is still acceptable. This result is interesting from a biological perspective since it shows the importance of integrating prior information regarding the system under consideration even if the available information is limited. Negative experimental results showing that there is no connection between a TF and a gene are even more informative than positive experiments for this kind of analysis since they restrict the structure of the connectivity matrix very effectively.
The above correlation and reconstruction result for the subsample suggests that the experimental sampling frequency could have been reduced to about half the time points without too much loss of accuracy. Of course, if the correlation coefficient dropped much further that would indicate that a shorter time interval is required for an accurate reconstruction of the factor profiles.
We also apply the proposed algorithm to gene expression data from yeast , a dataset which has been used to evaluate the NCA algorithm by Kao et al. . They reduced the initial dataset of 2200 genes and 113 TFs to 441 and 33, respectively. This reduction in the size is required due to some network structural restrictions implied by the NCA algorithm. For comparison reasons, we also used this reduced dataset. Spellman et al.  used different culture synchronisation techniques: elutriation, -factor arrest, and cdc15 temperature sensitive mutant arrest. Thus, there are three datasets consisting of 14, 17, and, 24 time points, with time intervals of 30, 7, and 20 minutes, respectively.
The estimated yeast TF profiles are not as smooth as the E. coli profiles from the previous section. However, they are still smoother than the profiles estimated by the GNCA algorithm or the static FA algorithms. A possible explanation could be the—in comparison to E. coli—smaller correlation coefficients that are estimated for each time series: 0.68 for the elutriation, 0.56 for the factor arrest, and 0.45 for the cdc15 mutant arrest time series.
Figure 6 shows that different synchronisation methods result in different TF profiles and different estimations for the correlation coefficients particularly after accounting for differences in the time intervals. The comparatively large differences in the correlation coefficients hint at differences in the underlying dynamics. It seems, therefore, not advisable to combine datasets obtained under such different conditions for time series analyses based on such dynamics.
Finally, we measured the correlation of the constructed TF profiles with their corresponding expression profiles without finding any significant match. This result is consistent with that of Boulesteix and Strimmer. .
We presented a factor analysis algorithm that incorporates temporal correlation within each factor vector and also learns sparse factor loadings matrices. It is quite plausible that the underlying regulatory networks are quite sparse, and reconstruction algorithms based on this principle have received increasing interest recently. There is also increasing interest in time series experiments, since they are able to capture the dynamics of biological processes. The algorithm presented here utilises both sparsity and time correlation within a Bayesian framework. We estimated the distribution of time correlation coefficients for gene expression data from E. coli and yeast. It seems that a fair amount of correlation is present in the data, as might have been expected. Consequently, the ragged profiles resulting from a simple factor analysis without time constraints seem to be artifacts; it is simply unlikely that activity levels fluctuate so wildly as seen for some genes. The profiles reconstructed incorporating the estimated correlation certainly look smoother, and it is likely that they are closer to the true activity profiles than the ragged reconstructions by simple factor analysis methods. For the E. coli dataset, we demonstrated the effect of the time interval on time correlation and reconstructed profiles. Finally, we showed that the reconstruction process is reliable even with little or no information regarding the connectivity matrix. However, some form of prior knowledge is necessary to obtain smoother profiles or to match unknown factors to known TFs.
One drawback of the GNCA and FA algorithms is that they only model linear relationships between the TFs and the regulated gene. However, as shown in the PhD thesis by Pournara  the assumption of linearity is not a severe one given the small amount of data and the significant amounts of noise present in microarray data. Moreover, the linearity assumption is less of a problem after mRNA abundance levels or ratios are transformed logarithmically. As theoretical considerations (see ) show, gene regulation by several factors might be considered additive on the logarithmic scale, unless there is severe interaction between the factors. Some discussion of this issue is also provided in . Nevertheless, we are currently exploring approaches to nonlinear modelling using Gaussian processes and investigating ways to include time delays in the FA algorithms that can cause shifts to the profiles of the regulated genes. For the data used here, preliminary results using a Gaussian processes algorithm show that all factor to gene interactions can be effectively described by linear relationships.
where is the prior hyperparameter of the Wishart distribution. As mentioned above, in this paper, we fix to avoid some identifiability problems.
The treatment of the temporal correlation matrix is more complicated. If we assume that the temporal matrix is completely unknown, a conjugate prior distribution is the inverted Wishart distribution with prior hyperparameter .
This approach greatly increases the number of parameters to be estimated to . This large number of extra parameters causes problems (singular matrix) to the Gibbs sampler, and thus we have to considerably reduce this number of unknown parameters.
First-Order Markov Structure
If we assume that the correlation matrix has a known structure, we can significantly reduce the number of estimated parameters. We have examined a series of different structures included the single-parameter structure and concluded that the first-order Markov structure gives the best results, and thus we will only discuss this latter case.
where . For and , the beta distribution equals the uniform distribution. For large values of , approaches 1, while for large values of approaches −1.
where , , , and . Note that the prior distribution of is wrongly assigned in [, page 285]. Moreover, the posterior distribution is slightly different due to the difference in the treatment of the noise covariance matrix. Finally, the loadings matrix is also treated differently here (see below).
where , , , and is a row vector that corresponds to the th row of .
A Gibbs sampler algorithm (Algorithm 1) is used to estimate the unknown parameters. We first initialise the loadings and noise covariance matrices. Then, we iterate through steps 1 and 2 sampling from the posterior conditional distributions of the parameters. In the analysed datasets, convergence was reached within the first 100 samples. However, we discarded the first 3000 samples in order to ensure convergence and collect another 500 samples for the analysis.
Algorithm 1: Gibbs sampler algorithm for factor analysis.
Step 1. Sample from (6),
sample from (14) using a rejection sampling algorithm.
Step 2. Sample and from (15) and (17), respectively.
The sign of the factors can change during the Gibbs sampling algorithm. Moreover, two factors can alternate position in the factors matrix. These are two widely known problems with factor analysis. In order to overcome these problems, we suggest the use of the Hungarian algorithm which solves assignment problems in polynomial time. A cost matrix of dimensions is constructed given the inferred factor matrices at rounds and of the Gibbs sampler. The element of the cost matrix, is the mean-square error between the th and th profiles of the factor matrix at time and , respectively. This cost matrix is used to solve the assignment problem by minimising the sum of the costs. For the details of the algorithm, the reader is referred to the seminal paper by Kuhn . We also used this algorithm to assign inferred factor profiles to known TF profiles, that is, reconstructed by the GNCA algorithm.
The choice of prior hyperparameters is very important in a Bayesian framework. We choose uninformative priors for most of the parameters. The estimation of the parameter is fairly insensitive to the choice of the hyperparameters. Thus, we used a uniform prior distribution for the E. coli dataset and a prior of and for the yeast data. In the time series FA algorithm, the choice of the hyperparameter for the noise covariance matrix is important. While a very flat prior is a good choice in the static FA, in the time series FA analysis a prior distribution away from zero gives better results. However, we note that as long as the distribution is away from zero, the algorithm is robust with respect to the exact width of this distribution. In our experiments, the shape and scale hyperparameters were set to 10. These parameter values give a distribution away from zero and one that does not include unrealistically large values. For the parameter, we choose shape and scale parameters 10 and 0.1, respectively, in order to favour sparse connectivity matrices. To indicate that our reconstructed profiles are not a result of the choice of the hyperparameters, Figure 1 shows the reconstructed profiles for the static FA algorithm using exactly the same hyperparameters that are used in the time series FA algorithm.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.