Learning restricted Boolean network model by time-series data

Ouyang, Hongjia; Fang, Jie; Shen, Liangzhong; Dougherty, Edward R; Liu, Wenbin

doi:10.1186/s13637-014-0010-5

Research
Open access
Published: 15 July 2014

Learning restricted Boolean network model by time-series data

Hongjia Ouyang¹,
Jie Fang¹,
Liangzhong Shen¹,
Edward R Dougherty^2,3 &
…
Wenbin Liu^1,2

EURASIP Journal on Bioinformatics and Systems Biology volume 2014, Article number: 10 (2014) Cite this article

3276 Accesses
9 Citations
1 Altmetric
Metrics details

Abstract

Restricted Boolean networks are simplified Boolean networks that are required for either negative or positive regulations between genes. Higa et al. (BMC Proc 5:S5, 2011) proposed a three-rule algorithm to infer a restricted Boolean network from time-series data. However, the algorithm suffers from a major drawback, namely, it is very sensitive to noise. In this paper, we systematically analyze the regulatory relationships between genes based on the state switch of the target gene and propose an algorithm with which restricted Boolean networks may be inferred from time-series data. We compare the proposed algorithm with the three-rule algorithm and the best-fit algorithm based on both synthetic networks and a well-studied budding yeast cell cycle network. The performance of the algorithms is evaluated by three distance metrics: the normalized-edge Hamming distance $μ_{ham}^{e}$ , the normalized Hamming distance of state transition $μ_{ham}^{st}$ , and the steady-state distribution distance μ^ssd. Results show that the proposed algorithm outperforms the others according to both $μ_{ham}^{e}$ and $μ_{ham}^{st}$ , whereas its performance according to μ^ssd is intermediate between best-fit and the three-rule algorithms. Thus, our new algorithm is more appropriate for inferring interactions between genes from time-series data.

1 Introduction

A key goal in systems biology is to characterize the molecular mechanisms governing specific cellular behaviors and processes. This entails selecting a model class for representing the system structure and state dynamics, followed by the application of computational or statistical inference procedures to reveal the model structure from measurement data. The models of gene regulatory networks run the gamut from coarse-grained discrete networks to the detailed description of stochastic differential equations [[1]]. They provide a uniform way to study biological phenomena (e.g., cell cycle) and diseases (e.g., cancer) and ultimately lead to systems-based therapeutic strategies [[2]].

Boolean networks, and the more general class of probabilistic Boolean networks, are one of the most popular approaches for modeling gene networks. The inference of gene networks from high-throughput genomic data is an ill-posed problem. There exists more than one model that can explain the data. The search space for potential regulator sets and their corresponding Boolean functions generally increases exponentially with the number of genes in the network and the number of regulatory genes. It is particularly challenging in the face of small sample sizes, because the number of genes typically is much greater than the number of observations. Thus, estimates of modeling errors, which themselves are determined from the measurement data, can be highly variable and untrustworthy. Many inference algorithms have been proposed to elucidate the regulatory relationships between genes. Mutual information (MI) is an information-theoretic approach that can capture the nonlinear dependence between random variables. REVEAL is the first information-based algorithm to infer the regulatory relationships between genes [[3]]. However, a small MI does not necessarily mean that no regulatory relationship exists between genes (false negative). Conversely, a large MI does not necessarily mean a real regulatory relationship. ‘False-positive’ relationships often result from indirect interactions between two genes. The data processing inequality (DPI) and conditional mutual information (CMI) are two methods used to reduce the problem of false positives [[4],[5]]. Another information-based method is the minimum description length principle (MDL), which achieves a good trade-off between model complexity and fit to the data [[6]–[10]]. The coefficient of determination (CoD) selects a set of predictors whose expression levels can be used to better predict the expression of a target gene relative to the best possible prediction in the absence of observations [[11],[12]]. The best-fit extension incorporates inconsistencies generated from measurements or other unknown latent factors by constructing a network that makes as few misclassifications as possible [[13],[14]]. Any prior knowledge about the network structure or dynamics likely improves inference accuracy, especially for small sample sizes. Theoretical considerations and computational studies suggest that gene regulatory networks might operate close to a critical phase transition between ordered and disordered dynamical regimes [[15],[16]]. Liu et al. proposed a method to embed such a criticality assumption into the inference procedure. Such regularization of the sensitivity can both improve the inference and move the inferred networks closer to criticality [[17]].

A restricted Boolean network is a simplified Boolean model that has been used to study dynamical behavior of the yeast cell cycle [[18]–[24]]. In this model, the regulatory relationship between genes is either upregulation or downregulation. The output of the target gene is mainly dominated by the summation of its input genes. When the input summation is zero, the output state will remain as the current state of the target gene. The inference algorithm mentioned above generally cannot deal with this situation, and thus may not be appropriate to infer such network models. Recently, Higa et al. proposed a ‘three-rule algorithm’ to construct a restricted Boolean network from time-series data [[25]]. Their idea is that the consecutive state transitions of the system must be driven by some constraints, which can be induced from the small perturbations between two similar system states (detailed rules are provided in Section 3.1). However, the perturbations in microarry data sometimes may be caused by stochastic biological randomness or measurement process instead of real changes in gene expression level. This makes the three-rule algorithm inevitably lead to some incorrect constraints. In this paper, we propose a systematic method to infer a restricted Boolean network based on the state transitions of the target gene. Results of simulated networks and a modeled yeast cell cycle show that the proposed algorithm is more robust to noise than the three-rule method.

This paper is organized as follows: Background information and definitions are given in Section 2. Section 3 presents a brief introduction to the three rules; after which, we systematically analyze the regulatory relationships between input genes and their target gene and propose an inference algorithm. Section 4 and Section 5 present results for the simulated networks and for the cell cycle model of budding yeast. Concluding remarks are given in Section 6.

2 Background

2.1 Boolean networks

A Boolean network G(V, F) is defined by a set of nodes V = {x₁, …, x_n}, x_i ∈ {0, 1} and a set of Boolean functions F = {f₁, …, f_n} and $f_{i} : {\{0, 1\}}^{k_{i}} \to \{0, 1\}$ . Each node x_i represents the expression state of gene x_i, where x_i = 0 means that the gene is off, and x_i = 1 means it is on. Each node x_i is assigned a Boolean function $f_{i} (x_{1}, \dots, x_{k_{i}})$ with k_i specific input nodes, which is used to update its value. Under the synchronous updating scheme, all genes are updated simultaneously according to their corresponding update functions. The network's state at time t is represented by a binary vector x(t) = (x₁(t), …, x_n(t)). In the absence of noise, the state of the system at the next time step is

x (t + 1) = F (x_{1} (t), \dots, x_{n} (t))

(1)

The long-run behavior of a deterministic Boolean network (BN) depends on the initial state, and the network will eventually settle down and cycle endlessly through a set of states called an attractor cycle. The set of all initial states that reach a particular attractor cycle forms the basin of attraction (BOA) for the cycle. Following a perturbation, the network in the long run may randomly escape an attractor cycle, be reinitialized, and then begin its transition process anew. For a BN with perturbation probability p, its corresponding Markov chain possesses a steady-state distribution. It has been hypothesized that attractors or steady-state distributions in Boolean formalisms correspond to different cell types of an organism or to cell fates. In other words, the phenotypic traits are encoded in the attractors [[1]]. There are two ways to define the perturbation probability p. One is that each gene can flip its state according to an i.i.d random perturbation vector γ = (γ₁, ⋯, γ_n), where γ_i ∈ {0, 1}, the i th gene flips if and only γ_i = 1, and p = P(γ_i = 1) for i = 1, 2, ⋯, n. The other is each state x(t) can transit to any other state with the same probability p. In this situation, at each time step, state x(t) will transit to the next state according to F with probability 1 + p − 2ⁿ ∗ p and other states with probability p. In this paper, we adopt the later definition of the perturbation probability p.

2.2 Restricted Boolean networks

Restricted Boolean networks are simplified Boolean networks in which the regulatory relationships between genes obey the following convention: a_ij = 1 represents a positive regulation from gene x_j to x_i (activation); a_ij = − 1 represents a negative regulation from gene x_j to x_i (inhibition); and a_ij = 0 means that x_j has no effect on x_i. The Boolean function $f_{i} (x_{1}, \dots, x_{k_{i}})$ is defined as [[18]]

x_{i} (t + 1) = \{\begin{array}{c} 1, if \sum_{j \in \{1, \dots, k_{i}\}} a_{ij} x_{i} (t) > 0 \\ 0, if \sum_{j \in \{1, \dots, k_{i}\}} a_{ij} x_{i} (t) < 0 \\ x_{i} (t), if \sum_{j \in \{1, \dots, k_{i}\}} a_{ij} x_{i} (t) = 0 . \end{array}

(2)

This model is ‘restricted’ in the sense that functions satisfying formula (2) constitute a subset of the class of all Boolean functions. The number of restricted functions decreases dramatically as the input degree k_i increases. For example, there are 12 ( $< 2^{2^{2}} = 16$ ) restricted functions for k_i = 2, and only 60 functions ( $< < 2^{2^{3}} = 256$ ) for k_i = 3. The restricted model significantly reduces the model space, which is beneficial for inference, given a limited number of noisy high-throughput data.

3 Methods

3.1 Three-rule method

A time-series observation can be treated as a trajectory (or random walk) of the state space of the network used to model a real biological system. The three-rule method proposed by Higa et al. is to induce the constraints between genes from the small difference between two similar states and the difference between their next states [[25]]. Given an m-point time series S = {S(1), S(2), …, S(m)} of gene expression profiles, where S(t) ∈ {0, 1}ⁿ for t = 1, 2, …, m, the three rules are as follows:

Rule 1: Let S(t − 1), S(t), and S(t + 1) be three consecutive states. If S(t − 1) and S(t) differ by a single gene x_k, then for each gene x_i such that x_i(t) ≠ x_i(t + 1), we have x_k directly regulates x_i; that is, a_ik ≠ 0.

Rule 2: Only the active genes at time t can possibly regulate genes at time t + 1.

Rule 3: Given two similar states S(t₁) and S(t₂), the difference between S(t₁ + 1) and S(t₂ + 1) must result from the genes in their predecessors S(t₁) and S(t₂) that are expressed differently.

Both rules 1 and 3 can also be extended to situations where S(t − 1) and S(t) or S(t₁) and S(t₂) differ in more than one gene. Cyclically applying these rules to any two states may lead to a group of constraint inequalities between variables a_ij. Many available constraint satisfaction problem solvers (CSPs) [[26]] can be used to solve the possible regulatory relationships of one gene to the target gene.

Rules 1 and 3 may give incorrect relationships if applied to noisy data; in other words, they are very sensitive to the noise inherent in data. We demonstrate this by using a small network that contains only four genes (see Figure 1). An arrow represents positive regulation, a line segment with a bar at the end represents negative regulation, and the dotted loop on x₂ indicates that this gene downregulates itself. The time-series data at the right in Figure 1 are extracted from the network in Figure 1. Between S(1) and S(2), only x₂ changes from 1 to 0, and only x₃ flips from 0 to 1 in the successive states S(2) and S(3). We can conclude that x₂ must inhibit x₃ by applying rule 1, which means a₃₂ = − 1 because turning off x₂ turns on x₃. If S(2) becomes 1001 owing to noise, then we will also have that gene x₄ inhibiting x₂, which means a₂₄ = − 1.

3.2 Analysis of regulatory relationships based on constraints

In this section, we study the regulatory relationships based on the constraint inequalities in formula (2) and how the target gene switches from one state to another. The target gene can switch in one of four ways: 0 → 0, 0 → 1, 1 → 0, or 0 → 1. Given an input state, inactive genes have no effect on the target gene, which may help reduce the constraint inequalities of the summation ∑ _ja_ijx_j(t) (1 ≤ j ≤ k_i). Because the null input provides no constraints between a_ij, we only need to investigate the non-null input situations.

First, consider the simplest situation where there is only one regulatory gene $x_{j_{1}}$ . If gene $x_{j_{1}}$ is active and the target gene x_i switches from 0 to 1, then gene $x_{j_{1}}$ must activate the target gene x_i (which means $a_{i j_{1}} = 1$ ). On the contrary, if the target gene x_i switches from 1 to 0, then it must be inhibited by $x_{j_{1}}$ (which means $a_{i j_{1}} = - 1$ ). When the target gene x_i remains in state 1, we have $a_{i j_{1}} x_{j_{1}} \geq 0$ (which means $a_{i j_{1}} = 1$ ). When the target gene x_i remains in state 0, we have $a_{i j_{1}} x_{j_{1}} \leq 0$ (which means $a_{i j_{1}} = - 1$ ). We present the four possible regulatory relationships $a_{i j_{1}}$ in Table 1.

Table 1 Regulatory relationships for one input gene

Full size table

When there are two regulatory genes $x_{j_{1}}$ and $x_{j_{2}}$ , we only consider the input states 01, 10, and 11. If only one input gene is active, such as $x_{j_{1}} x_{j_{2}} = 01$ , then we can directly determine $a_{i j_{2}}$ from Table 1, whereas $a_{i j_{1}}$ remains totally non-determinant because it has no effect on the target gene. If both gene $x_{j_{1}}$ and gene $x_{j_{2}}$ are active, then we need to know whether or not the target gene x_i switches its state. First, if x_i switches from 1 to 0, then we have $a_{i j_{1}} = a_{i j_{2}} = - 1$ to satisfy the constraint $a_{i j_{1}} + a_{i j_{2}} < 0$ . Similarly, if x_i switches from 0 to 1, then we have $a_{i j_{1}} = a_{i j_{2}} = 1$ to satisfy the constraint $a_{i j_{1}} + a_{i j_{2}} > 0$ . Second, if x_i remains in state 0, then we have $a_{i j_{1}} = a_{i j_{2}} = - 1$ or $a_{i j_{1}} = - a_{i j_{2}}$ because $a_{i j_{1}} + a_{i j_{2}} \leq 0$ . Similarly, if x_i remains in state 1, then we have $a_{i j_{1}} = a_{i j_{2}} = 1$ or $a_{i j_{1}} = - a_{i j_{2}}$ because $a_{i j_{1}} + a_{i j_{2}} \leq 0$ . We call these later cases ‘semi-determined’ because there are two possible combinations of $a_{i j_{1}}$ and $a_{i j_{2}}$ in each case. In Table 2, we present the 12 possible regulatory relationships of $a_{i j_{1}}$ and $a_{i j_{2}}$ for two input genes.

Table 2 Regulatory relationships for two input genes

Full size table

Analogously, the regulatory relationships for three input genes are shown in Table 3. There are 10 semi-determined cases, and most of them occur when the target gene x_i does not change. Some of the semi-determined cases in Tables 2 and 3 may become determined if some a_ij are determined. For example, given $a_{i j_{1}} + a_{i j_{2}} \leq 0$ for (3) in Table 2, we can determine $a_{i j_{2}} = 1$ if $a_{i j_{1}}$ is determined to be 1. However, $a_{i j_{1}}$ still remains semi-determined (either 1 or −1) if $a_{i j_{1}}$ is determined to be −1. As the number of regulatory genes increases, the proportion of semi-determined cases increases significantly. We will not extend the above analysis to situations of more than three input genes. In most reference studies, the limit k_i ≤ 3 is generally respected to mitigate model complexity, particularly for small sample sizes.

Table 3 Regulatory relationships for three input genes

Full size table

Given a target gene x_i and its predictor genes x_j (1 ≤ j ≤ k_i), we may determine the value of a_ij at each time point t (1 ≤ t ≤ m − 1) by searching Tables 1, 2, or 3 across the whole time series S = {S(1), S(2), …, S(m)}. Let $N_{ij}^{- 1}$ , $N_{ij}^{1}$ , and $N_{ij}^{- 1, 1}$ denote the number of a_ij = − 1, a_ij = 1, and a_ij = − 1 or 1, respectively. The degree of determination of a regulatory relationship a_ij is defined as

d_{ij} = |N_{ij}^{- 1} - N_{ij}^{1}| .

(3)

If $N_{ij}^{- 1} > N_{ij}^{1}$ , then a_ij is likely to be −1; otherwise, it is likely to be 1. The larger the value of d_ij, the greater the determination of a_ij. In order to reduce the semi-determined cases, we first find the one with the largest determination, say, a_ij,, and determine its value by the majority rule. Then, we apply the value of a_ij to those inequalities including it to solve other semi-determined a_ip (p ≠ j, 1 ≤ p, j ≤ k_i). By repeating this process, we can reduce the number of semi-determined cases and determine the values of other a_ip accordingly.

3.3 Error analysis

Given a predictor set for gene x_i, the basic inconsistency is the discrepancy in the determination of a_ij, and we define the error resulting from such an inconsistency by $ε_{ij}^{- 1, 1} = min (N_{ij}^{- 1}, N_{ij}^{1})$ . A second kind of inconsistency arises from the null input. Specifically, the target gene x_i cannot flip its state under null input situations. Moreover, if it is negatively self-regulated (self-degradation), it cannot be active when its input genes are null. The number of such inconsistencies defines the error $ε_{i}^{null}$ , which is listed in Table 4 for self-degradation and no self-degradation, respectively. The total error of a predictor set is defined by $ε = ε_{i}^{null} + \sum_{j} ε_{ij}^{- 1, 1}$ . Generally, a consistent predicator set should have the minimal error and the minimal number of regulatory genes simultaneously.

Table 4 Errors in the null-input situations

Full size table

3.4 A small example

We now apply the above analysis to infer the predicator set for gene x₃ in Figure 1. Based on Tables 1,2,3,4, the results for all possible one- and two-input genes at each time point are presented in Tables 5,6,7,8, respectively. In those six possible predictor sets, the minimal error is achieved by x₁ and x₂, which are just the regulatory genes of x₃.

Table 5 Regulatory relationships a _{3
j} for one input x ₁ (or x ₂ or x ₄ ) at each time step

Full size table

Table 6 Regulatory relationships a _{3
j} for two inputs x ₁ and x ₂ at each time step

Full size table

Table 7 Regulatory relationships a _{3
j} for two inputs x ₁ and x ₄ at each time step

Full size table

Table 8 Regulatory relationships a _{3
j} for two inputs x ₂ and x ₄ at each time step

Full size table

3.5 Inference algorithm

Given a time series S = {S(1), S(2), …, S(m)}, the minimal error predictor sets may not be unique. Each of them can be viewed as fitting the target gene in a different way. We employ the heuristic that if one gene occurs frequently in those sets, then it is highly probably to be a true regulatory gene. Combining them may give a more reliable prediction and can also help alleviate the constraint of using at most three input genes for a target gene. Given a target gene x_i, we propose the following algorithm to infer its regulatory gene set:

1.
Calculate the total error of each combination of one, two, or three regulatory gene sets P(x _i).
2.
Sort the predictor sets in ascending order of their errors.
3.
If a gene appears in the first l sets with a frequency greater than or equal to 50%, then it is selected as a regulatory gene.

4 Implementation

As mentioned in the introduction, many algorithms have been proposed to infer gene regulatory networks. A recent study shows that the best-fit algorithm appears to give the best results for the recovery of regulatory relationships among REVEAL, BIC, MDL, uMDL, and Best-Fit [[27]]. In this paper, we compare the performance of the three-rule algorithm, the best-fit algorithm and the proposed algorithm based on both synthetic networks as well as on a well-studied budding yeast cell cycle network.

We have implemented the three-rule algorithm and our proposed algorithm based on the PBN Toolbox (http://code.google.com/p/pbn-matlab-toolbox/), which includes the implementation of best-fit algorithm and the calculation of the steady state distribution and other intervention modules for Boolean networks. Genetic regulatory networks are commonly believed to have sparse connectivity topology. To evaluate the inference algorithms based on simulated time series of network states, we have restricted the random BNs to resemble this property of biological networks. Specifically, we have generated random BNs with a scale-free topology, and each gene has at most five predictors: $= ma x_{i = 1}^{n} k_{i} \leq 5$ . We uniformly assign each gene 1 to K regulators that upregulate (1) or downregulate (−1) it. The average connectivity of random networks is (1 + K)/2.

In order to compare the performance of the three algorithms with the ground-truth network, we use the following three distances [[28],[29]]:

(1)
The normalized-edge Hamming distance,
$μ_{ham}^{e} = \frac{FN + FP}{P + N},$
(4)
where FN and FP represent the number of false-negative and false-positive wires, respectively. P and N represent the total number of positive and negative wires, respectively. This Hamming distance reflects the accuracy of the recovered regulatory relationships.
(2)
The normalized Hamming distance of state transitions,
$μ_{ham}^{st} = \frac{1}{n * 2^{n}} \sum_{i = 1}^{n} \sum_{k = 1}^{2^{n}} [f_{i} (x_{k}) \oplus f_{i}^{'} (x_{k})],$
(5)
where f _i(•) and $f_{i}^{'} (•)$ represent the Boolean function of gene i in the ground-truth network and the inferred network, respectively; x_k represents a binary state vector, and ⊕ denotes modulo-2 addition. This Hamming distance indicates the accuracy of the inferred network for predicting the next state of the ground-truth network.
(3)
The steady-state distribution distance,
$μ^{ssd} = \sum_{k = 1}^{2^{n}} |π_{k} - π_{k}^{'}|,$
(6)
where π _k and $π_{k}^{'}$ are the steady-state distribution of state x _k in the ground-truth network and the inferred network, respectively. The steady-state distribution distance reflects the degree of an inferred network approaching the long-run behavior of the ground-truth network.

5 Results and discussion

5.1 Simulated results

Owing to the computational complexity and the network state space, which increases exponentially with the number of genes or the network size, all our simulations are based on networks with n = 10 genes. We generate 300 random Boolean networks respectively with maximal input degree K = 3 and K = 5. For each simulated network, we generate about 4 time series so that the total time points add up to 40. Given a specific sample data, the noise is added by flipping the value of each bit with probability 0.05 and 0.10, respectively. The steady-state distribution is calculated by a perturbation parameter p = 0. 0001. For the proposed algorithm, we selected the first l = 10 minimal error predictor sets. For best fit, we selected the minimal error predictor sets from k = 1, 2, 3. In Table 9, we list the average number of true-positive and false-positive connections for K = 3 and K = 5 in different noise intensities.

Table 9 Average number of true-positive and false-positive connections for three algorithms

Full size table

Figure 2 shows the performance of three algorithms on networks with K = 3 under different noise intensities according to three distance metrics: the normalized-edge Hamming distance $μ_{ham}^{e}$ , the normalized Hamming distance of state transition $μ_{ham}^{st}$ , and the steady-state distribution distance μ^ssd. The performance of the three-rule algorithm and the proposed algorithm is very close when there is no noise. However, it differs dramatically in noisy data. Specifically, the performance of the proposed algorithm increases as the sample size increases while that of the three-rule algorithm decreases. The main reason lies in the fact that the proposed algorithm infers the regulatory relation based on the entire time series instead of on a small perturbation between two time points, which makes it more robust against noise than the three-rule algorithm. Given a specific noise intensity η, with more samples, there are more noisy perturbed bits; so, more incorrect connections will be inferred by the three-rule algorithm. Table 9 shows that the number of the false positives of the three-rule algorithm increases more quickly than that of the true positives as the sample size increases. This is the main factor which makes its performance deteriorate even though the sample size increases. Consequently, the three-rule algorithm is very sensitive to noise in the data, and increasing sample size makes no improvement in its performance.

Compared with the best-fit algorithm, the proposed algorithm performs better with respect to $μ_{ham}^{e}$ and $μ_{ham}^{st}$ . In a restricted Boolean network model, the output of states with $\sum_{j} a_{ij} x_{i} (t) = 0$ is determined by the current state of the target gene x_i. This means that given the same input state, x_i may be 1 at one time and be 0 at another time. The best-fit algorithm does not allow such situation, and it will treat such a case in the data as an error. If the target gene x_i has three regulators and one downregulates it, then there will be 3 such states out of the 8 possible input states. The influence of such cases on the performance of best-fit algorithm can not be neglected. Additionally, the best-fit algorithm cannot deal with the inconsistency listed in Figure 3. These two factors hurt its performances as compared to the proposed algorithm on $μ_{ham}^{e}$ and $μ_{ham}^{st}$ . Table 9 shows that the number of the true positives of both algorithms is almost the same, but the number of false positives of the best-fit algorithm is larger than that of the proposed algorithm.

Concerning the steady state distribution distance μ^ssd, the proposed algorithm performs not so well as the best-fit algorithm. However, their difference decreases as the noise intensity increases. As pointed in [[27]], the inferred networks with relative more connections can explain the observed data better with respect to steady-state distribution distance μ^ssd, even though some are incorrect connections. Because the best-fit algorithm infers more connection than the proposed algorithm (see Table 9), it performs better on μ^ssd than the latter. On the other hand, the proposed algorithm is more robust than the best-fit algorithm as it combines those minimal error sets to determine the regulatory gene instead of selecting one. When noise intensity increases, the performance of the best-fit algorithm will drop more quickly than that of the proposed algorithm, which leads to their performance on μ^ssd converges.

Figure 4 shows the performance of three algorithms on networks with K = 5, which are analogous to the trends observed in Figure 2. The only difference is that the performance of the three algorithms decreases because the networks' complexity makes them hard to infer. In summary, the proposed algorithm performs better than the three-rule algorithm on the three distance metrics in noisy situations, whereas it performs less well than the best-fit algorithm on the steady-state distribution distance. This suggests that it is more feasible to infer the structure of restricted Boolean network model than the three-rule algorithm and best-fit algorithm.

5.2 Cell cycle model of budding yeast

The cell cycle is a vital biological process in which one cell grows and divides into two daughter cells. It consists of four phases, G1, S, G2, and M, and is regulated by a highly complex network that is highly conserved among the eukaryotes. From the 800 genes involved in the cell cycle process of budding yeast, Li et al. constructed a network of 11 key regulators: Cln3, MBF, SBF, Cln1, Cdh1, Swi5, Cdc20, Clb5, Sic1, Clb1, and Mcm1 [[18]]. This restricted Boolean network model (shown in Figure 4A) has an attractor whose biggest basin corresponds to the biological G1 stationary state. The temporal sequence in Table 10 is a pathway from this basin, which follows the biological trajectory of the cell cycle network.

Table 10 Temporal evolution of state for cell cycle

Full size table

We have applied the three algorithms to the above artificial time-series data and show the inferred networks in Figure 4. In the simplified model of the budding yeast cell cycle, there are a total of 34 regulatory relationships (or connections). The three-rule algorithm inferred 10 relationships, all correct (see Figure 4B). The best-fit algorithm inferred 15 correct and 5 incorrect relationships (see Figure 4C). The proposed algorithm inferred 15 correct and 4 incorrect relationships (see Figure 4D). Both best-fit and the proposed algorithms inferred more true regulatory relationships than the three-rule algorithm with some incorrect connections. For studying regulatory relationships, this may be more advantageous because more potential regulatory relationships are made available for biologists to check in the wet lab.

We also ran 100 simulations with 5% and 10% noises for this pathway. Even for the same pathway data, the result of each noisy pathway data differs dramatically. This is not surprising because noise significantly influences the determination of regulatory relations for all algorithms. The performance of the three algorithms on $μ_{ham}^{e}$ , $μ_{ham}^{st}$ , and μ^ssd is listed Table 11. The relative performance of the three algorithms for this pathway data is also consistent with the previous simulation results.

Table 11 The performance of the three algorithms for the yeast-pathway data

Full size table

5.3 Computational issues

When inferring real networks with moderate size, the time complexity of algorithms is a key issue. Almost all algorithms proposed to date possess exponential complexity. The time complexity of the proposed algorithm and best-fit algorithm is $(n \cdot C_{n}^{k} \cdot m)$ . The most time-consuming process for the three-rule algorithm is to solve the constraint inequalities, and its time complexity is O(n ⋅ cⁿ ⋅ m²) (1 < c < 2). From this point of view, the three-rule algorithm is more time consuming than the other two.

The proposed algorithm is similar in workflow to the best-fit algorithm; however, additional computation time results from three factors: (1) determination of the possible regulatory relationships, (2) determination during error estimation if an output state is correct for a given model according to Equation (2), and (3) combination of the first ten least-error models in the last step.

In practice, however, algorithm complexity is not the limiting factor. As shown in Table 12, for 11, 12, and 13 genes, and for N = 20 and N = 40, the proposed algorithm's computation time is between the best-fit and the three-rule algorithms, but the overriding computational issue is computation of the steady-state distribution, which is often required for application. It is for this reason that interest has focused on reducing network complexity [[29]–[31]].

Table 12 Algorithm timings (seconds)

Full size table

6 Conclusion

The model space of Boolean networks is huge and from the point of view of evolution, it is unimaginable for nature to select its operational mechanisms from such a large space. Restricted Boolean networks, as a simplified model, have recently been extensively used to study the dynamical behavior of the yeast cell cycle process. In this paper, we propose a systematic method to infer the restricted Boolean network from time-series data. We compare the performance of the three-rule, best-fit, and the proposed algorithms both on simulated networks and on an artificial model of budding yeast. Results show that our algorithm performs better than the three-rule and best-fit algorithms according to the distance metrics $μ_{ham}^{e}$ and $μ_{ham}^{st}$ , but slightly less well than the best-fit algorithm according to μ^ssd. This result indicates that the proposed algorithm may be more appropriate for recovering regulatory relationships between genes under the restricted Boolean network model.

The main advantage of the proposed algorithm is that it is more robust to noise than both the three-rule algorithm and best-fit algorithm. The proposed algorithm infers the regulatory relationships according to the consecutive state transitions of the target gene, instead of the small perturbations between two similar states in the three-rule algorithm. Simulation results show that noise in the data may induce many incorrect constraints by the three-rule algorithm. This hinders its application to noisy samples. Moreover, the proposed algorithm can capture the intrinsic state transition defined in Equation 2, whereas the best-fit algorithm cannot. Hence, because the inference processes of both algorithms try to find the minimal-error predictor set, the proposed algorithm can distinguish error in the data more accurately than the best-fit algorithm. Additionally, combination of the minimal error predictor sets in the proposed algorithm also improves its robustness.

In the Boolean formalism, a single time series (or trajectory) can be treated as a random walk across state space. It is not possible to recover the complex biological system from just one short trajectory by any method. Using heterogeneous data and some a priori knowledge is typically a necessity. A priori knowledge can be incorporated into the proposed algorithm and helps by reducing the search space. For instance, an algorithm might assume a prescribed attractor structure [[32]]. In our case, if we know that x regulates y, then we only consider those combinations containing x, thereby reducing the search space. Additionally, different methods may focus on different aspects of the inference process. For example, the best-fit algorithm and CoD are mainly concerned with the fitness of the data, whereas MDL-based methods intend to reduce structural risks. Future work will involve combining MDL with the proposed algorithm to reduce the rate of false positives.

References

Ilya S, Dougherty ER: Genomic Signal Processing (Princeton Series in Applied Mathematics). Princeton University Press, Princeton; 2007.
Google Scholar
Ilya S, Dougherty ER: Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks. Siam, Philadelphia; 2010.
Google Scholar
Shoudan L, Stefanie F, Roland S: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures, in Pacific Symposium on Biocomputing. World Scientific, Hawaii; 1998.
Google Scholar
Margolin AA, Ilya N, Katia B, Chris W, Gustavo S, Riccardo DF, Andrea C: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinforma 2006, 7: S7.
Article Google Scholar
Zhao W, Serpedin E, Dougherty ER: Recovering genetic regulatory networks from chromatin immunoprecipitation and steady-state microarray data. EURASIP J. Bioinforma. Syst. Biol. 2008.
Google Scholar
Vijender C, Preetam G, Edward P, Gong GP, Deng Y, Zhang C: A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst. Biol. 2010, 4: S7.
Google Scholar
Vijender C, Chaoyang Z, Preetam G, Perkins EJ, Gong P, Deng Y: Gene regulatory network inference using predictive minimum description length principle and conditional mutual information. In International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS). Edited by: Zhang J, Li G, Yang JY. IEEE Computer Society, Piscataway; 2009:487-490.
Google Scholar
Dougherty J, Tabus I, Astola J: A universal minimum description length-based algorithm for inferring the structure of genetic networks. In IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). Edited by: Huang Y. IEEE, Piscataway; 2007:1-2.
Google Scholar
Tabus I, Astola J: On the use of MDL principle in gene expression prediction. EURASIP J Appl Signal Process 2001, 2001: 297-303.
Article MathSciNet Google Scholar
Zhao W, Erchin S, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22: 2129-2135.
Article Google Scholar
Dougherty RE, Seungchan K, Yidong C: Coefficient of determination in nonlinear signal processing. Signal Process. 2000, 80: 2219-2235.
Article Google Scholar
Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM: General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J. Biomed. Opt. 2000, 5: 411-424.
Article Google Scholar
Shmulevich I, Dougherty ER, Seungchan K, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 2002, 18: 261-274.
Article Google Scholar
Lähdesmäki H, Shmulevich I, Yli-Harja O: Learning gene regulatory networks under the Boolean network model. Mach. Learn. 2003, 52: 147-167.
Article Google Scholar
Shmulevich I, Kauffman SA, Maximino A: Eukaryotic cells are dynamically ordered or critical but not chaotic. Proc. Natl. Acad. Sci. U. S. A. 2005, 102: 13439-13444.
Article Google Scholar
Nykter M, Price ND, Maximino A, et al.: Gene expression dynamics in the macrophage exhibit criticality. Proc. Natl. Acad. Sci. 2008, 105: 1897-1900.
Article Google Scholar
W Liu, H Lähdesmäki, ER Dougherty, I Shmulevich, Inference of Boolean networks using sensitivity regularization. EURASIP J. Bioinforma. Syst. Biol. (2008). doi:10.1155/2008/780541
Google Scholar
Li F, Long T, Ying L, Ouyang Q, Tang C: The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. U. S. A. 2004, 101: 4781-4786.
Article Google Scholar
Zhang Y, Qian M, Ouyang Q, Deng M, Li F, Tang C: Stochastic model of yeast cell-cycle network. Physica D: Nonlinear Phenomena 2006, 219: 35-39.
Article MathSciNet Google Scholar
Kai-Yeung L, Surya G, Chao T: Function constrains network architecture and dynamics: a case study on the yeast cell cycle Boolean network. Phys. Rev. E. 2007, 75: 051907.
Article Google Scholar
Bornholdt S: Boolean network models of cellular regulation: prospects and limitations. J. R. Soc. Interface 2008, 5: S85-S94.
Article Google Scholar
Davidich MI, Stefan B: Boolean network model predicts cell cycle sequence of fission yeast. PLoS One 2008, 3: e1672.
Article Google Scholar
Ronaldo Fumio H, Henrique S, Carlos HA H: Budding yeast cell cycle modeled by context-sensitive probabilistic Boolean network. In IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). Edited by: Braga-Neto U. IEEE, Piscataway; 2009:1-4.
Google Scholar
Todd RG, Tomáš H: Ergodic sets as cell phenotype of budding yeast cell cycle. PLoS One 2012, 7: e45780.
Article Google Scholar
Higa CHA, Louzada VHP, Andrade TP, Hashimoto RF: Constraint-based analysis of gene interactions using restricted Boolean networks and time-series data. BMC Proc. 2011,5(Suppl 2):S5.
Article Google Scholar
Niklas E, Niklas S: An extensible SAT-solver. In Theory and Applications of Satisfiability Testing. Edited by: Giunchiglia E, Tacchella A. Springer, New York; 2004:502-518.
Google Scholar
Dougherty ER: Validation of gene regulatory networks: scientific and inferential. Brief. Bioinform. 2011, 12: 245-252.
Article Google Scholar
Xiaoning Q, Dougherty ER: Validation of gene regulatory network inference based on controllability. Front. Genet. 2013, 4: 272.
Google Scholar
Ghaffari N, Ivanov I, Qian X, Dougherty ER: A CoD-based reduction algorithm for designing stationary control policies on Boolean networks. Bioinformatics 2010, 26: 1556-1563.
Article Google Scholar
Ivanov I, Simeonov P, Ghaffari N, Xiaoning Q, Dougherty ER: Selection policy-induced reduction mappings for Boolean networks. Signal Process. IEEE Trans. 2010, 58: 4871-4882.
Article Google Scholar
Qian X, Ghaffari N, Ivanov I, Dougherty ER: State reduction for network intervention in probabilistic Boolean networks. Bioinformatics 2010, 26: 3098-3104.
Article Google Scholar
Pal R, Ivanov I, Datta A, Bittner ML, Dougherty ER: Generating Boolean networks with a prescribed attractor structure. Bioinformatics 2005, 21: 4021-4025.
Article Google Scholar

Download references

Acknowledgements

This work was funded in part by the National Science Foundation of China (Grant No. 61272018, No. 60970065, and No. 61174162) and the Zhejiang Provincial Natural Science Foundation of China (Grant No. R1110261 and No. LY13F010007) and support from China Scholarship Council.

Author information

Authors and Affiliations

Department of Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, 325035, Zhejiang, China
Hongjia Ouyang, Jie Fang, Liangzhong Shen & Wenbin Liu
Department of Electrical and Computer Engineering, Texas A&M University, College Station, 33101, TX, USA
Edward R Dougherty & Wenbin Liu
Computational Biology Division, Translational Genomics Research Institute, Phoenix, 77843, AZ, USA
Edward R Dougherty

Authors

Hongjia Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Fang
View author publications
You can also search for this author in PubMed Google Scholar
Liangzhong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Edward R Dougherty
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbin Liu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ouyang, H., Fang, J., Shen, L. et al. Learning restricted Boolean network model by time-series data. J Bioinform Sys Biology 2014, 10 (2014). https://doi.org/10.1186/s13637-014-0010-5

Download citation

Received: 15 December 2013
Accepted: 12 May 2014
Published: 15 July 2014
DOI: https://doi.org/10.1186/s13637-014-0010-5

Learning restricted Boolean network model by time-series data

Abstract

1 Introduction

2 Background

2.1 Boolean networks

2.2 Restricted Boolean networks

3 Methods

3.1 Three-rule method

3.2 Analysis of regulatory relationships based on constraints

3.3 Error analysis

3.4 A small example

3.5 Inference algorithm

4 Implementation

5 Results and discussion

5.1 Simulated results

5.2 Cell cycle model of budding yeast

5.3 Computational issues

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords