Using the minimum description length principle to reduce the rate of false positives of best-fit algorithms

Fang, Jie; Ouyang, Hongjia; Shen, Liangzhong; Dougherty, Edward R; Liu, Wenbin

doi:10.1186/s13637-014-0013-2

Research
Open access
Published: 03 July 2014

Using the minimum description length principle to reduce the rate of false positives of best-fit algorithms

Jie Fang¹,
Hongjia Ouyang¹,
Liangzhong Shen¹,
Edward R Dougherty^2,3 &
…
Wenbin Liu^1,2

EURASIP Journal on Bioinformatics and Systems Biology volume 2014, Article number: 13 (2014) Cite this article

2911 Accesses
2 Citations
Metrics details

Abstract

The inference of gene regulatory networks is a core problem in systems biology. Many inference algorithms have been proposed and all suffer from false positives. In this paper, we use the minimum description length (MDL) principle to reduce the rate of false positives for best-fit algorithms. The performance of these algorithms is evaluated via two metrics: the normalized-edge Hamming distance and the steady-state distribution distance. Results for synthetic networks and a well-studied budding-yeast cell cycle network show that MDL-based filtering is more effective than filtering based on conditional mutual information (CMI). In addition, MDL-based filtering provides better inference than the MDL algorithm itself.

1 Introduction

A key goal in systems biology is to characterize the molecular mechanisms that govern specific cellular behavior and processes. Models of gene regulatory networks run the gamut from coarse-grained discrete networks to detailed descriptions of such networks by stochastic differential equations [1]. Boolean networks and the more general class of probabilistic Boolean networks are among the most popular approaches for modeling gene networks because they provide a structured way to study biological phenomena (e.g., the cell cycle) and diseases (e.g., cancer), ultimately leading to systems-based therapeutic strategies. The inference of gene networks from high-throughput genomic data is an ill-posed problem known as reverse engineering. It is particularly challenging when dealing with small sample sizes because the number of variables in the system (e.g., the number of genes) typically is much greater than the number of observations [2]. Many inference algorithms have been proposed to elucidate the regulatory relationships between genes, such as Reveal [3], ARACNE [4], the minimum description length principle (MDL) [5]–[9], the coefficient of determination (CoD) [10],[11], and the best-fit extension [12],[13].

False positives are a common problem in inference, especially when dealing with small sample sizes and noisy conditions. In fact, false positives are a kind of structural redundancy. Given three genes, x₁, x₂, and x₃, they may interact in a chain-like manner, such as x₁ → x₂ → x₃ or x₁ ← x₂ ← x₃; or in a hub-based way, such as x₁ → x₂ ← x₃ or x₁ ← x₂ → x₃. Indirect interactions between two genes may produce some correlation in their expression data, which can lead to a false regulation detection by inference algorithms. The data-processing inequality (DPI) was first used in ARACNE, which aims to reduce the false positives produced by chain interaction [4]. Later, conditional mutual information (CMI) was proposed to tackle the false positives produced by both the chain-like and hub-based interactions [14]. Because the conditioning gene, x₂, is usually not known, a greedy search strategy was adopted to check if the CMI between x₁ and x₃ conditioned on some other genes was below a given threshold. To check the CMI on other unrelated genes is problematic. Not only is it computationally burdensome, it also suffers from an enormous multiple-comparisons problem. Moreover, since the interaction strength between genes generally varies a lot, their being both strong and weak interactions, how to set an appropriate threshold is a key problem.

A recent study shows that the best-fit algorithm appears to give the best results for recovering regulatory relationships in comparison to the aforementioned algorithms [15]. In the present paper, we propose to reduce the false positives of the best-fit algorithm by using the MDL principle. Simulation results show that it is more effective than the CMI-based method and can reduce the false positives in the MDL algorithm in [5]. In effect, the false-positive reducing procedure acts as a filter for removing false positives.

The aim of filtering in the present framework is to reduce the number of false positive connections. As with any false-positive reducing algorithm, this will invariably increase the number of false negatives, meaning more missing connections. Thus, two questions must be addressed. First, what benefits accrue from reducing the number of false positives? Second, does the increase in false negatives significantly impact inference performance?

A salient problem in translational genomics is the utilization of gene regulatory networks in determining therapeutic intervention strategies [2],[16],[17]. A big obstacle in deriving optimal treatment strategies from networks is the computational complexity arising directly from network complexity. Hence, significant effort has been focused on network reduction [18],[19]. As with any compression scheme, reduction methods sacrifice information in return for computational tractability. Because genes are removed from the network based upon their regulatory relations with other genes, false positives are particularly troublesome. First, they increase the amount of reduction necessary and second, they compete with true positive connections for retention in the reduced network. While it is true that an increase in false negatives is not beneficial, a missing connection creates no additional computational burden (in fact, reduces computation) and plays no role in the reduction procedure.

Now, for the caveat, all of this is fine, so long as the accuracy of the original inference algorithm is not adversely impacted. Practically, this means that, relative to some distance function between a ground-truth network and an inferred network (which quantifies inference accuracy), the distance is not increased when using the modified false-positive reducing algorithm in place of the original algorithm. In this paper, we will consider two distance functions, one based on the hamming distance between the ground-truth and inferred networks and the other based on the difference between the steady-state distributions of the ground-truth and inferred networks.

This paper is organized as follows: Background information and necessary definitions are given in Section 2. The implementation of MDL, the best-fit algorithm, and CMI- and MDL-based filtering is then introduced in Section 3. Results from simulated networks and from the cell cycle model of budding yeast are presented in Section 4. Finally, concluding remarks are given in Section 5.

2 Background

2.1 Boolean networks

A Boolean network G(V, F) is defined by a set of nodes V = {x₁, …, x_n}, x_i ∈ {0, 1}, and a set of Boolean functions F = {f₁, …, f_n}, $f_{i} : {\{0, 1\}}^{k_{i}} \to \{0, 1\}$ Each node x_i represents the expression state of a gene, where x_i = 0 means that the gene is off and x_i = 1 means it is on. To update its value, each node x_i is assigned a Boolean function $f_{i} (x_{i 1}, \dots, x_{i k_{i}})$ with k_i specific input nodes. Under the synchronous updating scheme, all genes are updated simultaneously according to their corresponding update functions. The network's state at time t is represented by a binary vector x(t) = (x₁(t), …, x_n(t)). In the absence of noise, the state of the system at the next time step is

x (t + 1) = F (x_{1} (t), \dots, x_{n} (t)) .

(1)

The long-term behavior of a deterministic Boolean network depends on the initial state. The network will eventually settle down and cycle endlessly through a set of states called an attractor cycle. The set of all initial states that reach a particular attractor cycle forms the basin of attraction for the cycle. Following a random perturbation, the network may escape an attractor cycle, be reinitialized, and then begin its transition process anew. For a Boolean network with perturbation, its corresponding Markov chain possesses a steady-state distribution. It has been hypothesized that attractors or steady-state distributions in Boolean formalisms correspond to different cell types of an organism or to cell fates. In other words, the phenotypic traits are encoded in the attractors or steady-state distribution [1].

2.2 Best-fit extension

One approach to infer Boolean networks is to search a consistent rule from examples, the so-called consistency problem [20]. Owing to noise in gene-expression profiles, we relax it to the called best-fit extension problem, which has been extensively studied for many function classes [21]. We briefly introduce the best-fit extension problem for Boolean functions. A partially defined Boolean function (pdBf) is defined by two sets, T, F ⊆ {0, 1}ⁿ, where T and F represent the set of true and false vectors, respectively. A function f is called an extension of pdBf(T, F) if T ⊆ T(f) = {x ∈ {0, 1}ⁿ : f(x) = 1} and F ⊆ F(f) = {x ∈ {0, 1}ⁿ : f(x) = 0}. The magnitude of the error of function f is

ε (f) = T \cap F (f) + F \cup T (f) .

(2)

The best-fit extension aims to find two subsets T* and F* such that T* ∩ F* = ϕ and T* ∪ F* = T ∪ F, for which the function pdBf(T*, F*) has an extension in some class C of Boolean functions such that T* ∩ F + F * ∪ T is minimized. Clearly, any extension f ∈ C of pdBf (T*, F*) has minimum error magnitude [12],[13].

2.3 Conditional mutual information

Mutual information (MI) is a general measurement that can detect nonlinear dependence between two random variables X and Y. For discrete-valued random variables, the one-time-lag MI from X_t to Y_t + 1 is given by

I (Y_{t + 1}; X_{t}) = H (Y_{t + 1}) - H (Y_{t + 1} | X_{t})

(3)

where H(•) denotes entropy and X_t and Y_{t + 1} are two equal-length vectors. The conditional mutual information (CMI) from X_t to Y_{t + 1} given Z_t is

I (Y_{t + 1}; X_{t} | Z_{t}) = H (Y_{t + 1} | Z_{t}) - H (Y_{t + 1} | X_{t}, Z_{t}),

(4)

and quantifies the reduction in the uncertainty of Y_t+1 due to knowledge of X_t given Z_t. In the chain-like or hub-based scenarios, genes X_t and Y_t+1 should be independent given the intermediate or hub gene Z_t, which means that I(X_t; Y_t + 1|Z_t) = 0.

2.4 Minimum description length principle

A fundamental principle in model selection is the minimum description length (MDL) principle, which states that we should choose the model that gives the shortest description of the data. The ‘two-part MDL’ developed by Rissanen consists of writing the description length of a given model applied to a data set as the sum of the code length for describing the model and the code length for describing the data set fit by the model [22]

L = L_{M} + L_{D} .

(5)

There are various ways to encode the model-coding length L_M and the data-coding length L_D. Given a time series of length m, Zhao et al. proposed to encode L_M and L_D as [5]

L_{M} = τ \sum_{i = 1}^{n} \{d_{i} * k_{i} + d_{f} * 2^{k_{i}}\},

(6)

L_{D} = - \sum_{i = 1}^{n} \sum_{t = 1}^{m - 1} log p (x_{i} (t + 1) | x_{i 1} (t) \dots x_{i k_{i}} (t)),

(7)

where τ is a free parameter to balance the model- and data-coding lengths, n and m are the number of genes and time points. d_i = ⌈ log₂n⌉ and d_f = ⌈ log₂m⌉ denote the number of bits needed to code an integer and a floating-point number, respectively.

3 Implementation

Based on the common assumption that genetic regulatory networks are sparsely connected, we restrict simulated Boolean networks to a scale-free topology with maximal connectivity K = 4 and average connectivity k = 2. The best-fit algorithm searches for the best-fit function for each gene by exhaustively searching for all combinations of potential regulator sets. The search space grows exponentially with the number of genes. In practice, the limit k_i ≤ 3 is generally applied to mitigate model complexity. In this paper, we restrict best-fit-algorithm searches to combinations of 1, 2, or 3 possible regulators. The combinatorial set with the smallest error is then selected as the regulatory set. We call this best-fit-I. In practice, the minimal error predictor set may not unique. We employ the heuristic that each of them can be viewed as fitting the target gene in a different way and if one gene occurs frequently in those sets, then it is highly likely to be a true regulatory gene. Thus, we can determine the regulatory set by applying the majority rule in these sets. Here, we refer to this algorithm as best-fit-II.

Then CMI and MDL criteria are used to filter false-positive connections. For each regulatory connection, if the CMI for one of the remaining genes is less than 0.005, then the gene is deleted; otherwise, it remains. The MDL criterion is applied to each target gene x_i. Given its parent set, Pa(x_i), we delete the regulatory gene x_j ∈ Pa(x_i) that can maximally reduce its coding length L_i for each point in time, repeating this process until the deletion of one regulatory gene causes L_i to increase. We implement an MDL inference algorithm by directly searching the combination of 1, 2, or 3 possible regulators with minimal coding length L_i. The free parameter τ in Equation 6 is set to 0.2.

We have analyzed CMI- and MDL-based filtering by using both synthetic networks as well as the well-studied cell-cycle model known as the budding-yeast network. We compare them with the ground-truth network according to the following two distances [15],[23]:

(1)
The normalized-edge Hamming distance:

μ_{ham}^{e} = \frac{FN + FP}{P},

(8)

where FN and FP represent the number of false-negative and false-positive wires, respectively, and P represents the total number of positive wires. This Hamming distance reflects the accuracy of the recovered regulatory relationships.

(2)
The steady-state distribution distance:

μ^{ssd} = \sum_{k = 1}^{2 n} |π_{k} - π_{k}^{'}|,

(9)

where π_k and $π_{k}^{'}$ are the steady-state probabilities state x_k in the ground-truth and inferred network, respectively. The steady-state distribution distance reflects the degree to which an inferred network approximates the long-run behavior of the ground-truth network.

4 Results and discussion

4.1 Simulation on synthetic networks

We generated 1,000 random n = 10 genes and for each network generated a random sample of m = 10, 20, 30, 40, and 50 time points. As it is hard to obtain one time series with required length, we adopt the following sampling strategy: (1) select several start states which are the farthest from their attractor; (2) run each start state to its attactor; (3) select one path as a time series, if its length is shorter than required, add another path in it until we have required length of time points. We added 5% and 10% noise to these samples to investigate the effect of noise. The perturbation probability to calculate the steady-state distribution was set to p = 0.0001. In Table 1, we list the average number of true-positive and false-positive connections for various noise intensities. Figure 1 shows the average performance of the MDL, best-fit-I, and best-fit-II filtered by CMI and MDL for 0%, 5%, and 10% noise. As a whole, the performance of these algorithms increases as sample size increases from 10 to 50. This result is easy to understand: the more data we have, the better the inferred results.

Table 1 Average number of true-positive and false-positive connections for MDL, best-fit-I, and best-fit-II filtered by CMI and MDL

Full size table

Examination of the table reveals several trends. First, MDL-based filtering (dashed lines in Figure 1) always performs better than CMI-based filtering (dotted lines in Figure 1). MDL-based filtering aims to reduce the redundancy of a model according to the MDL principle, whereas CMI-based filtering attains reduction by blindly checking if the CMI of a connection conditioned on all other genes is below a given threshold. The results indicate that the former approach is superior to the latter. According to Table 1, on the whole, MDL-based filtering retains more true connections and deletes more false connections than CMI-based filtering.

Second, the performances of MDL, best-fit-I, and best-fit-II are very similar when used with noiseless data. In this case, the MDL algorithm gives a model with L_D = 0, which also corresponds to the zero-error model obtained by best-fit-I. In addition, MDL-based filtering results in little improvement over the best-fit algorithms. However, their performance is strongly related to sample size when the data are noisy. Specifically, for sample size less than 30, MDL performs better than best-fit-I and best-fit-II based on the average Hamming-edge distance $μ_{ham}^{e}$ . But MDL performs worse than best-fit-I and best-fit-II for sample sizes lager than 30, because the structural regularization of MDL is beneficial only for small sample sizes whereas it leads to overfitting for large sample sizes. From Table 1, we see that, compared with best-fit-I and best-fit-II, the rate of false positives is relatively low for MDL with small sample sizes and relatively high for MDL with large sample sizes. Concerning the steady-state distribution distance μ^ssd, MDL performs better than best-fit-I and best-fit-II for data with 5% noise, but the performance of these algorithms becomes equivalent for data with 10% noise. This result may be due to the noise not only deteriorating the inference of the regulatory relationships, but also deteriorating the interaction Boolean functions, which strongly influence μ^ssd.

Third, for noisy situations, based on $μ_{ham}^{e}$ and μ^ssd, not only does MDL-based filtering not degrade performance, it improves the performance of best-fit-I and best-fit-II, with the performance for best-fit-II being slightly better than that of best-fit-I. One reason for this result may be that best-fit-II infers more true-positive connections and less false-positive connections in small-sample situations (see Table 1). It is interesting that, in noisy situations, MDL-based filtering can even outperform the MDL algorithm across all sample sizes. In essence, the two methods are totally different because the former aims to reduce the structural redundancy of the minimal-error model obtained by the best-fit algorithm, whereas the latter aims to search the model with the minimum coding length L. From the point of view of the MDL principle, the coding length L of MDL-based filtering may not be the minimum length. Because MDL-based filtering combines both the best-fit algorithm and the MDL principle, it reduces structural redundancy and overcomes the over-fitting in large-sample-size situations.

4.2 Cell cycle model of budding yeast

The cell cycle is a vital biological process in which one cell grows and divides into two daughter cells. It consists of four phases, G1, S, G2, and M, and is regulated by a highly complex network that is highly conserved among the eukaryotes. From the 800 genes involved in the cell cycle process of budding yeast, Li et al. constructed a network of 11 key regulators: Cln3, MBF, SBF, Cln1, Cdh1, Swi5, Cdc20, Clb5, Sic1, Clb1, and Mcm1 [24]. This Boolean network model, shown in Figure 2A, has an attractor whose biggest basin corresponds to the biological G1 stationary state. The temporal sequence in Table 2 is a pathway from this basin that follows the biological trajectory of the cell cycle network.

Table 2 Temporal evolution of state for the cell cycle

Full size table

We applied MDL, best-fit-I, and best-fit-II filtered by CMI and MDL to the artificial time-series data in Table 2. The inferred networks are shown in Figure 2. Figure 2B shows the network inferred by the MDL algorithm, which is the best network. Figure 2C,D has the same number of true-positive connections, with the latter having fewer false-positive connections. This result demonstrates that the method of selecting regulatory genes in best-fit-II is superior to using best-fit-I. Compared with Figure 2E,F, which was filtered by CMI from Figure 2C,D, Figure 2G,H filtered by MDL have more true connections, whereas the number of false-positive connections are about the same. Furthermore, we can see that the networks resulting from CMI-based filtering have two disconnected subgraphs, whereas the network resulting from MDL is a connected graph. This result shows that MDL-based filtering is more effective than CMI-based filtering. In fact, Figure 2G shows the same result as in Figure 2B, which is the best result.

We also ran 100 simulations with 5% and 10% noise for the pathway under consideration. Table 3 lists the average number of true positives and false positives, the normalized Hamming-edge distance $μ_{ham}^{e}$ and the steady-state distribution distance μ^ssd. The results are consistent with those of the simulated networks (Figure 1) and they demonstrate that MDL-based filtering is effective for samples containing a small amount of noise.

Table 3 Comparison of MDL, best-fit-I, and best-fit-II with CMI- and MDL-based filtering for yeast-pathway data

Full size table

5 Conclusion

Reducing the rate of false positives is an important issue in network inference. In this paper, we address this question by using the minimum description length (MDL) principle. Specifically, we apply the MDL measurement technique proposed by Zhao et al. to filter the model obtained by two best-fit algorithms (best-fit-I and best-fit-II). We compare the performance of MDL, best-fit-I, and best-fit-II filtered by CMI and MDL both on simulated networks and on an artificial model of budding yeast. The results show that, as determined by the distance metrics $μ_{ham}^{e}$ and μ^ssd, MDL-based filtering does not degrade inference performance, can improve inference performance, and is more effective than CMI-based filtering. Moreover, the combination of MDL filtering with the best-fit algorithm can even outperform the MDL algorithm alone. Additionally, applying MDL-based filtering is computationally less burdensome than using the MDL algorithm alone because calculating the data-coding length L_D is more complex than calculating the error estimate of the best-fit algorithm, and the complexity of the calculation increases dramatically as the sample size m increases. Last but not the least, MDL-based filtering can also be applied to the results of other minimal error algorithms such as CoD.

References

I Shmulevich, ER Dougherty, Genomic Signal Processing (Princeton Series in Applied Mathematics) (Princeton University Press, Princeton, 2007)
MATH Google Scholar
I Shmulevich, ER Dougherty, Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks (SIAM, Philadelphia, 2010)
Book MATH Google Scholar
Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures, in Pacific Symposium on Biocomputing. World Scientific, Singapore; 1998.
Google Scholar
Adam AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla RF, Califano A: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006, 7: S7.
Google Scholar
Wentao Z, Erchin S, Dougherty ER: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22: 2129-2135. 10.1093/bioinformatics/btl364
Article Google Scholar
Chaitankar V, Ghosh P, Perkins E, Ping G, Youping D, Chaoyang Z: A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst. Biol. 2010, 4: S7. 10.1186/1752-0509-4-S1-S7
Article Google Scholar
CV Chaitankar, Z Chaoyang, G Preetam, P Ghosh, EJ Perkins, G Ping, D Youping, Gene regulatory network inference using predictive minimum description length principle and conditional mutual information (International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, 2009), pp. 487–490. IJCBS'09, 2009
Google Scholar
Dougherty J, Tabus I, Astola J: Inference of gene regulatory networks based on a universal minimum description length. EURASIP J. Bioinform. Syst. Biol. 2008, 2008: 482090.
Article Google Scholar
Tabus I, Astola J: On the use of MDL principle in gene expression prediction. EURASIP J. Appl. Signal Proc. 2001, 2001: 297-303. 10.1155/S1110865701000270
Article MathSciNet MATH Google Scholar
Dougherty ER, Kim S, Chen Y: Coefficient of determination in nonlinear signal processing. Signal Process. 2000, 80: 2219-2235. 10.1016/S0165-1684(00)00079-7
Article MATH Google Scholar
Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM: General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J. Biomed. Opt. 2000, 5: 411-424. 10.1117/1.1289142
Article Google Scholar
I Shmulevich, A Saarinen, O Yli-Harja, J Astola, Inference of genetic regulatory networks via best-fit extensions. Computational and Statistical Approaches to Genomics (Springer, US, 2002)
Google Scholar
Lähdesmäki H, Shmulevich I, Yli-Harja O: On learning gene regulatory networks under the Boolean network model. Mach. Learn. 2003, 52: 147-167. 10.1023/A:1023905711304
Article MATH Google Scholar
Zhao W, Serpedin E, Dougherty ER: Inferring connectivity of genetic regulatory networks using information-theoretic criteria. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008,5(2):262-274. 10.1109/TCBB.2007.1067
Article Google Scholar
Qian X, Dougherty ER: Validation of gene regulatory network inference based on controllability. Front. Genet. 2013, 4: 272. 10.3389/fgene.2013.00272
Article Google Scholar
Dougherty ER, Pal R, Qian X, Bittner ML, Datta A: Stationary and structural control in gene regulatory networks: basic concepts. Int. J. Syst. Sci. 2010,41(1):5-16. 10.1080/00207720903144560
Article MathSciNet MATH Google Scholar
Yousefi MR, Dougherty ER: Intervention in gene regulatory networks with maximal phenotype alteration. Bioinformatics. 2013,29(14):1758-1767. 10.1093/bioinformatics/btt242
Article Google Scholar
Ivanov I, Simeonov P, Ghaffari N, Qian X, Dougherty ER: Selection policy induced reduction mappings for boolean networks. IEEE Trans. Signal Process. 2010,58(9):4871-4882. 10.1109/TSP.2010.2050314
Article MathSciNet Google Scholar
Ghaffari N, Ivanov I, Qian X, Dougherty ER: A CoD-based reduction algorithm for designing stationary control policies on Boolean networks. Bioinformatics 2010, 26: 1556-1563. 10.1093/bioinformatics/btq225
Article Google Scholar
Akutsu T, Miyano S, Kuhara S: Identification of genetic networks from a small number of gene expression patterns under the boolean network model. Pac. Symp. Biocomput. 1999, 4: 17-28.
Google Scholar
Boros E, Ibaraki T, Makino K: Error-free and best-fit extensions of partially defined boolean functions. Inf. Comput. 1998, 140: 254-283. 10.1006/inco.1997.2687
Article MathSciNet MATH Google Scholar
Rissanen J: Modeling by shortest data description. Automatica 1978, 14: 465-471. 10.1016/0005-1098(78)90005-5
Article MATH Google Scholar
Dougherty ER: Validation of gene regulatory networks: scientific and inferential. Brief. Bioinform. 2011, 12: 245-252. 10.1093/bib/bbq078
Article Google Scholar
Li F, Long T, Ying L, Ouyang Q, Tang C: The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. USA 2004, 101: 4781-4786. 10.1073/pnas.0305937101
Article Google Scholar

Download references

Acknowledgements

This work was funded in part by the National Science Foundation of China (Grants No. 61272018, No. 60970065, and No. 61174162) and the Zhejiang Provincial Natural Science Foundation of China (Grants No. R1110261 and No. LY13F010007) and support from China Scholarship Council.

Author information

Authors and Affiliations

Department of Physics and Electronic information engineering, Wenzhou University, Wenzhou, 325035, Zhejiang, China
Jie Fang, Hongjia Ouyang, Liangzhong Shen & Wenbin Liu
Department of Electrical and Computer Engineering, Texas A&M University, College Station, 33101, TX, USA
Edward R Dougherty & Wenbin Liu
Center for Bioinformatics and Genomics Systems, College Station, 33101, TX, USA
Edward R Dougherty

Authors

Jie Fang
View author publications
You can also search for this author in PubMed Google Scholar
Hongjia Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Liangzhong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Edward R Dougherty
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbin Liu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Fang, J., Ouyang, H., Shen, L. et al. Using the minimum description length principle to reduce the rate of false positives of best-fit algorithms. J Bioinform Sys Biology 2014, 13 (2014). https://doi.org/10.1186/s13637-014-0013-2

Download citation

Received: 06 January 2014
Accepted: 14 June 2014
Published: 03 July 2014
DOI: https://doi.org/10.1186/s13637-014-0013-2

Using the minimum description length principle to reduce the rate of false positives of best-fit algorithms

Abstract

1 Introduction

2 Background

2.1 Boolean networks

2.2 Best-fit extension

2.3 Conditional mutual information

2.4 Minimum description length principle

3 Implementation

4 Results and discussion

4.1 Simulation on synthetic networks

4.2 Cell cycle model of budding yeast

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords