Relations between the setcomplexity and the structure of graphs and their subgraphs
 Tomasz M Ignac^{1, 2}Email author,
 Nikita A Sakhanenko^{1} and
 David J Galas^{1, 2}
DOI: 10.1186/16874153201213
© Ignac et al.; licensee Springer. 2012
Received: 22 December 2011
Accepted: 13 June 2012
Published: 21 September 2012
Abstract
We describe some new conceptual tools for the rigorous, mathematical description of the “setcomplexity” of graphs. This setcomplexity has been shown previously to be a useful measure for analyzing some biological networks, and in discussing biological information in a quantitative fashion. The advances described here allow us to define some significant relationships between the setcomplexity measure and the structure of graphs, and of their component subgraphs. We show here that modular graph structures tend to maximize the setcomplexity of graphs. We point out the relationship between modularity and redundancy, and discuss the significance of setcomplexity in this regard. We specifically discuss the relationship between complexity and entropy in the case of completebipartite graphs, and present a new method for constructing highly complex, binary graphs. These results can be extended to the case of ternary graphs, and to other multiedge graphs, which are fundamentally more relevant to biological structures and systems. Finally, our results lead us to an approach for extracting high complexity modular graphs from large, noisy graphs with low information content. We illustrate this approach with two examples.
Keywords
Setcomplexity Biological networks Modularity Modular graphs Bipartite graphs Multipartite graphsIntroduction
Most physical, communications, social, and biological networks are usefully represented as graphs, with varying levels of complexity. The topology and the statistical structures of these graphs are central to understanding the functional properties of these systems. Our primary concern here is the representation and properties of biological networks, as reflected in the graphs used to represent these complex systems. The application of our results, however, is significantly broader. Previous attempts to elucidate the fundamental concept of biological information have led to a proposed, general measure of complexity, or information content, based on Kolmogorov complexity [1, 2], that resolves some of the perplexing paradoxes of biologically relevant meaning that arise in definitions of information and complexity [1]. We used this approach successfully in analyzing the information in gene interaction networks of yeast [3, 4]. It was shown that the most informative networks are those with the highest setcomplexity (a detailed discussion about applications of the setcomplexity to biology and related problems can be found in the cited articles). The properties of our measure, which we call “setcomplexity”, are expected to be fruitful in describing a large class of problems in biology. It is clear, however, that we need more mathematical understanding of the properties of this complexity measure, and we have therefore focused initially on the setcomplexity of graphs, and begun by analyzing the mathematical properties of relatively simple structures.
The results here extend our previous results and increase understanding of the structure of graphs and subgraphs with the highest setcomplexity. We have previously suggested, for example, that highly complex graphs have a more modular architecture than others [4]. The aim of this article is twofold. First, we aim to provide a mathematical foundation for this suggestion, the relation between the setcomplexity and the graph structure. Second, we show that this research has practical uses. To accomplish the first goal we develop a formalism that allows us to analyze the setcomplexity in a rigorous fashion and capture some of its essential properties. Our approach uses stochastic methods to analyze graphs by defining specific random variables describing interactions between nodes in a graph. Informationtheoretical features of the variables defined are then used to investigate the setcomplexity, Ψ, measure. To accomplish the second goal, we present two examples illustrating how the setcomplexity theory can be used to identify specific subgraphs with modular properties. Note that the theoretical formalism of this article extends the ideas from our previous article [5] that presented a technical background of setcomplexity and its computation as well as initial analysis of complexity of some graphs. Article [5] does not touch the application of this formalism in finding modular structure from realworld networks, which is a major goal of this article.
The article is structured as follows. First, we describe basic definitions and notation, and present the relation between the complexity and the entropy for complete bipartite graphs (CBG), an important class of binary graph for this analysis. We then describe a method for constructing highly complex binary graphs and provide two examples which show how to use the setcomplexity to analyze information content of a graph and its subgraphs. We conclude the article by discussing results, open questions and plans for future work.
Preliminaries
Let G=(V,E) denote a graph, where V stands for the set of vertices and E the set of edges. The number of nodes in a graph is denoted by N, i.e., V={1,…,N}. Existence of an edge between nodes i and j is denoted by (i,j)∈E, and M labels for the graph edges are assumed. The labels are enumerated from 0 to M−1. Let us take a∈{0,…,M−1}. The notation (i,j)=a states that the label of the edge connecting nodes i and j is equal to a. We also assume that the graphs are fully connected in the following sense. A graph can always be formally extended to a multilabeled, fully connected graph by defining an edge label 0, the usual designation for no connection. For example, in binary graphs, which are the main subject of this article, (i,j)=1 means that nodes i and j are connected and (i,j)=0stands for a pair of disconnected nodes.
For each node i∈V we define the probability distribution P_{ i }(a), which is the fraction of nodes connected to node i by edges labeled a. In other words, if we choose a particular i and then randomly select another node, j, from the remaining N−1 nodes, the value of P_{ i }(a) is the probability of (i,j)=a. In a binary graph, P_{ i }(1) is the number of nodes connected to node i divided by N−1.
Remark 1. P_{ i }(a)and P_{ ij }(a,·)are two probability distributions of random variables defined on the same alphabet {0,…,M−1}. The difference between these two quantities is small: both tend to zero as N goes to infinity. P_{ i }(a)describes a situation when only one node is selected, and we randomly choose another node. P_{ ij }(a,·)describes a situation when we are given a pair of nodes and a third node is chosen at random. The value of the random variable is the label of the edge between i and the selected node.
We previously introduced the definition of mutual information for graphs [1]. Intuitively, it measures the reduction of the uncertainty about the connectivity of one node given the connectivity pattern of a second node. It is therefore natural to define this quantity as mutual information between random variables described by distributions P_{ ij }(a,·) and P_{ ij }(·,b), c.f., Remark 1.
Complexity of CBGs
A set of nodes in a CBG can be represented as a sum of two disjoint sets O_{1} and O_{2} such that if nodes i and j belong to different sets, then (i,j)=1, and if they belong to the same set, then (i,j)=0. Sets O_{1} and O_{2}are referred to as orbits. This is consistent with the graph theory definition of an orbit, which holds that an orbit is an equivalence class of nodes under the action of an automorphism [7]. This means that all nodes in an orbit are connected in the same way to other nodes. The symbol K_{m,N−m} is used to denote a CBG of size N, where m is the size of O_{1}.
Consider nodes i and j from the same orbit. By the definition of CBGs, (i,k)=(k,j) for any third node k. Thus, P_{ ij }(0,1)=P_{ ij }(1,0)=0. Consequently, P_{ ij }(0,0)=P_{ ij }(0,·)=P_{ ij }(·,0)and P_{ ij }(1,1)=P_{ ij }(1,·)=P_{ ij }(·,1). This leads us to P_{ ij }(0∣0)=P_{ ij }(1∣1)=1 and P_{ ij }(0∣1)=P_{ ij }(1∣0)=0. Similar reasoning holds for nodes from different orbits such that P_{ ij }(a∣a)=0and P_{ ij }(a∣b)=1 for a≠b. If we apply this result to Equation (6), we can see that the second component of the sum on the right hand side of the equation is zero. Therefore, we have proved the following lemma.
where q = m/N. A similar analysis shows that Equation (8) can also be used to approximate entropies when i,j∈O_{2} or i∈O_{1}, j∈O_{2}. The notation H(q)emphasizes that this quantity depends only on the proportion of nodes in orbits O_{1} and O_{2} and does not depend on the size of the graph.
Theorem 1. Let G_{ N }be a sequence of complete bipartite graphs, such that the ratio q=m/N is constant for all N. Then, $\underset{N\to \infty}{lim}\Psi \left({G}_{N}\right)=4\left({H}^{2}\right(q){H}^{3}(q\left)\right)$.
□
Note that the sum on the right hand side of Equation (10) consists of N(N−1)/2 identical elements. Thus, Equation (10) can be rewritten to the equation of the theorem. QED.
Figure 1 shows that CBGs with low values of q have complexity that is very close to the upper bound. Complexity of CBGs with high node entropies tends to zero (as the upper bound raises at the same time). This suggests a method for construction of complex graphs from CBGs.
Complex binary graphs
At the end of the last section we show that the graphs with high values of Ψ(close to one) should exhibit high values of node entropies similar to K_{N/2,N/2} graphs. This section shows that, even though K_{N/2,N/2}graphs have zero complexity, they are a good starting point for constructing highly complex graphs, in that a relatively small number of modifications is needed to increase Ψ substantially. We propose a stochastic transformation F_{ p }of a graph such that for any pair of nodes i and j the label of (i,j) is flipped to the opposite value with a probability p. We use G^{∗}to denote the graph produced by this transformation applied to G.
where E[·]stands for the expected value. A similar analysis conducted for nodes from different orbits reveals that E[P_{ ij }(a,a)]=p(1−p) and E[P_{ ij }(a,b)]=1/2−p(1−p), where a≠b.
We see that the expected value of the node entropies remains one, i.e., the transformation preserves the entropy of nodes in K_{N/2,N/2} graphs, but it alters the mutual information m_{ ij }. The complexity is maximized when m_{ ij }=1/2. Since the node entropies are close to one, it follows from Equation (7) that m_{ ij }=1/2 when H_{ ij }=3/2. We can calculate that for the transformation F_{ p }, E[H_{ ij }]=3/2 iff p≈0.058428. This discussion can be summarized in the following theorem.
Theorem 2. Let G_{ N }be a sequence of graphs, and let ${G}_{N}^{\ast}$ be a sequence of corresponding outputs of the transformation F_{ p }with p≈0.058428. Then, $\underset{N\to \infty}{lim}E\left[\Psi \right({G}_{N}^{\ast}\left)\right]=1$.
To illustrate this theorem experimentally, we applied the transformation to K_{N/2,N/2} graphs with N=50, 100, 200, 300, 500 nodes. The average values of Ψ(G^{∗})ranged from 0.9154 (with standard deviation 0.0185on 500 experiments) for N=50to 0.9926 (with standard deviation 0.0004 on 50 experiments) for N=500.
Applications
It is obvious that Ψ can be expressed as the average of ϕ_{ ij }.
One way of extending the analysis of a graph may be described as a problem similar to retrieving a signal from a noisy transmission of information. Here, the signal is a subgraph showing some type of regular structure, e.g., a set of nodes with similar connectivity pattern, and the noise comes from all the nodes that do not exhibit any regular connectivity patterns, such as the nodes of a random graph. Structures like this arise in biology whenever we locate members of a large set of objects based on some common properties, for example, when we select genes based on their correlated expression levels. In contrast to [8], we focus our attention on graphs with very low values of the complexity score. Low complexity graphs can have different characters: some of them may be simple random graphs, while others can have a very regular structure, like CBGs. Both of these types of graphs are uncommon in biological applications. On one hand, biological systems are not random; thus, characteristics of their network representations cannot exhibit values similar to those of randomly generated graphs. On the other hand, such graphs are not completely regular. In biological sciences we almost always deal with an interesting mixture of randomness and regularity. We will focus our attention here on graphs whose structure is a mix of random and regular connectivity patterns.
where T is a threshold for values of ϕ_{ ij }and 〈·〉 stands for the Iverson’s bracket, i.e., a logic function that takes value 1, if the statement inside the bracket is true, and 0 otherwise. In summary, for a specific i, Φ_{ i }(T) is the number of pairs (i,j) in the graph such that ϕ_{ ij }>T. By looking at the rightmost tail of the histogram of Φ_{ i }(T) we can identify nodes with the highest contribution to Ψ.
We now present two examples. The first one is an artificially generated graph and the second is based on a biological data set. The two examples are followed by the discussion of the proposed approach: relation to community detection, modularity of networks/data sets, possible applications and plans for future work. We want to stress that the purpose of this discussion is to show that the setcomplexity, and its components ϕ_{ ij }, of a graph gives us an insight into the graph’s structural properties. Nevertheless, this approach may also be interesting for analyzing real biological data.
Example 1: artificially generated graph
In the first example we use a 300 node graph consisting of two subgraphs. The first subgraph is a K_{25,25} graph and the second is a random graph (also randomly connected to the CBG) in which the probability that two nodes are connected is 1/2. The probability of an edge between a pair of nodes from different subgraphs is 1/2. Another example, based on real biological data, is given in the second example.
The graph overall exhibits a very low value of Ψ, relative to most CBGs, about 0.011. Low complexity indicates, in this case, a graph with a high number of randomly connected nodes. On the other hand, a low Ψ graph can be characteristic of a very regular graph structure. Looking at mutual information simply allows us to distinguish between a very regular and a very random graph. In the present example mutual information is low: its mean value is about 0.02. At the same time all node entropies are close to one. This indicates that the structure of the graph is more random than regular. Nevertheless, there is a modular subgraph in this graph.
Figure 3c shows the histogram of ϕ_{ ij }. As expected, most of these values are concentrated close to zero, and the right tail is almost invisible. Nevertheless, the right tail is present, and the comparison of ϕ_{ ij }for i=1 and i=51indicates that nodes from the complete bipartite subgraph make stronger contributions to the tail than nodes from the random subgraph. To illustrate this we fixed the threshold, T=0.05, and calculated the number of pairs with ϕ_{ ij }>0.05, defined as Φ_{ i }(0.05). Figure 3d shows the histogram of Φ_{ i }(0.05).
Let us take a closer look at what happens when we change T. Figure 3e,f show histograms of Φ_{ i }(0.025) and Φ_{ i }(0.1), respectively. The complete bipartite subgraph can be identified in both cases; however, in the first case (T=0.025) both groups of nodes are close to one another. Decreasing T below 0.025will result in misclassification of a significant number of nodes (mixing the two classes clearly separable in the present case). On the other hand, increasing T makes the group on the right more flat, therefore it becomes more difficult to distinguish between these groups. For example, in Figure 3f we show the histogram of Φ_{ i }(0.1) where the right group looks almost like a long tail of the group on the left.
As we can see the choice of the threshold T can be somewhat arbitrary at the outset. Our approach yields a tool for analyzing graphs. Thus, it could be used in a supervised mode, where T is specified by the user, or the threshold could be systematically scanned in an unsupervised mode.
Example 2: biological data set
We want to solve the problem of finding a set, or sets, of nodes with a similar connectivity pattern, which might represent a modular subgraph. This case is more difficult, because we do not know a priori that there is any modular structure. Consequently, we initially choose low values for T, to avoid omitting potentially relevant nodes.
Distribution of different types of edges in the 97 node subgraph of the original correlation graph
Type I edges  Type II edges  Type III edges  

(strong positive correlation)  (strong negative correlation)  
Module 1 (33 nodes)  373 (70.6%)  25 (4.7%)  130 (24.6%) 
Module 2 (64 nodes)  1644 (81.6%)  46 (2.3%)  326 (16.2%) 
Connections between modules  109 (5.2%)  1405 (66.5%)  598 (28.3%) 
Conclusion
We have shown that, in general, a modular structure maximizes the setcomplexity of a graph. It has been formally proved, however, that this is not always the case. If a binary graph is composed of two modules of identically connected nodes (orbits) and the modules have the same sizes, then the complexity of such a graph is almost zero. The complexity grows rapidly, however, when we perturb the graph structure by breaking this symmetry. The symmetry can be broken in two ways: either the number of nodes in the components of the CBG can be made unequal, or the complete bipartite character can be broken by adding or deleting edges [8]. Actually, the number of altered edges that can significantly increase Ψ is a relatively small number; and the bimodular structure of the graph is essentially preserved in a graph with significant Ψ. Similar results can be obtained for multicolored edge graphs, with M>2[8]. We presented a method and two examples here that suggest useful applications of the described theory to analyzing real biological data—finding highly informative modular subgraphs in a large graph.
There are several technical aspects of the analysis presented above that need to be considered. First, in the second example, the procedure was applied iteratively, twice. We chose a subgraph of interest and repeated the procedure on this subgraph. It is important to note that in the iterations the values of ϕ_{ ij } were recomputed for the subgraph only: the nodes and edges that are not in the subgraph are omitted from computation. Since the setcomplexity is defined as a context dependent measure, we treat one subset of nodes as a context for the other subset. Therefore, by omitting a group of nodes we change the context for the remaining nodes and change the complexity. It is clear that the subset of nodes considered is an important part of the definition of the setcomplexity.
Our examples illustrate how to use setcomplexity to capture the information content of a graph. For instance, histograms on Figure 6a,c show the increase of information when we narrow the original graph from 541 to 251 nodes. This information gain is also quantified by the setcomplexity, which increases from 0.06 to 0.32. This can be useful for an evaluation of a network. Even if a network seems to be uninformative, we can attempt to extract an informative set of hidden regular patterns by narrowing down the set of nodes. This can be especially useful for networks with multiple types of edges (multicolor graphs), for which existing community detection and clustering methods are not suitable.
We wish to point out a significant potential relationship between two ideas presented here. The notion of modularity, based on the common connectedness of sets of nodes, as reflected in the measure of mutual information in the graph, is closely related to the idea of redundancy. This is because the modularity often stems from sets of nodes that are connected in similar ways to other nodes. Redundancy, in turn, has a strong functional significance in all functional systems, which is that it provides a robustness against damage or loss. If there are two or more nodes that are connected in almost the same fashion, loss of one of these nodes or its connection(s) can be mitigated to some extent by having a stand in, or partial stand in, in another node. Clearly this is a quantitative issue that needs more attention to fully characterize. What is also clear is that with too much redundancy, or regularity, the range of responses and the sensitivity to a variety of inputs is limited. This qualitative notion parallels the very idea of maximizing Ψ in that regularity (similar to redundancy) is balanced against variety (similar to randomness). The idea is appealing in thinking about biology, in that the robustness to perturbation or damage and the sensitivity to perturbation of damage are two general properties that biological evolution seeks to balance in many ways. It may be that Ψ can provide some quantitative insight into this biological balancing act.
Though the concept of setcomplexity, defining a balance between regularity and randomness, is promising for future applications in biology, the two examples in this article are illustrations of a possible approach based on setcomplexity and should be viewed as complementary to traditional community detection algorithms. At the current stage of development, the proposed approach requires supervision, but it is clear that scanning through threshold parameter space will be a key to automating the method. Since this article (as well as [8]) provides a rigorous theoretical background for the setcomplexity of graphs, it should be possible to derive an automated approach for performing an analysis as illustrated in the examples. One possible direction for future research is to combine the search for a maximally complex subgraph with optimization techniques, such as hillclimbing, using stochastic sampling methods.
Another interesting extension to our work is to look at how to use setcomplexity as a specific measure of the modularity of graphs and of data sets. This extension would allow us to analyze modularity of multilabeled graphs, which is currently impossible using traditional measures of modularity, since there is no defined interpretation of modularity for graphs with various types of labels. This will be a direction for future work.
The setcomplexity was originally defined as a measure of complexity of sets of binary strings [1]. This definition can easily be used for characterizing the complexity of dynamics of various types of Boolean networks (for example, random, probabilistic), in which a binary string represents a state of a network and, thus, a dynamic trajectory of a network is a set of strings [1, 10]. We have defined the setcomplexity in terms of Kolmogorov complexity [1]. Unfortunately, since Kolmogorov complexity is incomputable, it needs to be approximated by algorithmic compression of binary strings, which represent states of the network. This approach has two drawbacks: (1) the approximated setcomplexity is not normalized, so it is difficult to compare complexities of networks with different size, and (2) we can say nothing about the structure of the sequences: we can only hypothesize that these strings should be somewhat similar to one another but, in contrast to the graph case, we cannot quantify these relations. It may be interesting to calculate the complexity of a set of strings in a manner similar to that presented in the current article. We have begun this type of analysis, and the preliminary results look promising. We believe that such an approach may give us interesting insights into the dynamics and information structures of various types of Boolean networks.
We have demonstrated that the probabilistic description of the setcomplexity sets up a formal framework for reasoning about some properties of our measure of complexity. We are able to prove some important properties of the setcomplexity of graphs. Such an approach can be fruitful in the further investigations of this subject. This may result in better understanding of the nature of complexity in system biology, which may play a key role from the perspective of practical applications of that theory.
Abbreviations
 CBG:

complete bipartite graph.
Declarations
Acknowledgements
This work was supported by the ISBLuxembourg Program, and by the FIBR program of NSF (0527023). TI is a fellow of the Luxembourg, LCSBISB fellowship program. We gratefully acknowledge stimulating conversations with Greg Carter and Ilya Shmulevich at various stages of this work. We thank Marek Ostaszewski from LCSB for providing the data and Paul Shannon for generating Figure 4.
Authors’ Affiliations
References
 Galas DJ, Nykter M, Carter GW, Price ND, Shmulevich I: Biological information as setbased complexity. IEEE Trans. Inf. Theory 2010, 56: 667677.MathSciNetView ArticleGoogle Scholar
 Kolmogorov AN: Three approaches to the definition of the concept quantity of information (Russian). Probl. Peredachi Inf 1965, 1: 311.MATHMathSciNetGoogle Scholar
 Carter GW, Galas DJ, Galitski T: Maximal extraction of biological information from genetic interaction data. PLOS Comput. Biol 2009, 54: e1000347.View ArticleGoogle Scholar
 Carter GW, Rush CG, Uygun F, Sakhanenko NA, Galas DJ, Galitski T: A systemsbiology approach to modular genetic complexity. Chaos 2010, 20: 026102. 10.1063/1.3455183View ArticleGoogle Scholar
 Ignac TM, Sakhanenko NA, Galas DJ: Relation between the setcomplexity of a graph and its structure. In Proceedings of the Eighth International Workshop on Computational Systems Biology: 6–8 June 2011. Edited by: Koeppl H, Acimovic J, Kesseli J, MakiMarttunen T, Larjo A, YliHarja O. Zurich, Switzerland: Tampere University of Technology, TICSP Series; 2011:8184.Google Scholar
 Cover TM, Thomas JA: Elements of Information Theory. New York: WileyInterscience; 1991.MATHView ArticleGoogle Scholar
 Gross J, Yellen J: Graph Theory and its Applications. Boca Raton: CRC Press Inc; 1999.MATHGoogle Scholar
 Ignac TM, Sakhanenko NA, Galas DJ: Complexity of networks II: the set complexity of edgecolored graphs. Complexity 2012, 17: 2336.MathSciNetView ArticleGoogle Scholar
 Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell 2002, 13: 19772000. 10.1091/mbc.02020030.View ArticleGoogle Scholar
 MakiMarttunen T, Kesseli J, Kauffman S, YliHarja O, Nykter M: On the complexity of Boolean network state trajectories. In Proceedings of the Eighth International Workshop on Computational Systems Biology. Edited by: Koeppl H, Acimovic J, Kesseli J, MakiMarttunen T, Larjo A, YliHarja O. Zurich, Switzerland, 6–8 June 2011: Tampere University of Technology, TICSP Series; 2011:137140.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.