Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring
© Durston et al.; licensee Springer. 2012
Received: 2 November 2011
Accepted: 29 May 2012
Published: 13 July 2012
Skip to main content
© Durston et al.; licensee Springer. 2012
Received: 2 November 2011
Accepted: 29 May 2012
Published: 13 July 2012
Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families.
The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function.
Our results demonstrate that the method we present here using a k- modes site clustering algorithm based on interdependency evaluation among sites obtained from a sequence alignment of homologous proteins can provide significant insights into the complex, hierarchical inter-residue structural relationships within the 3D structure of a protein family.
The determination of protein 3D structure using methods such as NMR and X-ray crystallography has made tremendous progress. Although the 3D structure of many proteins has been solved, there still remains the problem of understanding the internal relationships within the structure. Certain residues may require specific associations with other residues within the structure that are not necessarily spatially proximal. Certain pairwise, third-order, fourth-order, and higher-order associations may be essential for obtaining a stable structure, while other parts of the structure have a less important role. The challenge is to be able to identify key structural associations within the larger structure, with the objective of understanding what role they play within the larger structure or global function of the protein.
Granular computing is emerging as a computing paradigm of information processing based on the abstraction of information entities called information granules [1–3], which we define here as related entities that are abstracted from the protein family multiple sequence alignment data based on a possible shared function. Functional bioinformatics is a sub-discipline of bioinformatics that incorporates functionality into the analysis of biopolymers [4–6]. Within a multiple sequence alignment, each column represents an aligned site, where a site refers to a location within an amino acid sequence. For example, a set of aligned sequences for a protein family that, on average, consists of 150 amino acid positions (with inserted gaps), would have 150 sites in its sequence. It is useful for estimating the functionality characteristics of sequence segments and/or functional sites, which can add enormous insights into understanding the characteristics of biopolymers.
The importance of patterns of associated sites within protein sequences has long been recognized. Structural and functional characteristics of a protein family may often be dependent upon two or more sites that maintain the stability of the molecule [7, 8], such as in the situation of compensatory substitutions. In the late 1970s, based on statistics and information theory, Wong et al.  proposed a statistical analysis of site variability and interdependence in a protein family relating to the structural and functional relationships of sites in cytochrome c. Smith and Smith  developed a computer algorithm for detecting relationships among different sites in an amino acid sequence. Lichtarge et al. [11, 12] have developed an evolutionary trace method (ETM) to identify clusters of sites associated with function, by mapping sites with a high degree of conservation onto the surface of the solved structure. Liu and Califano  have suggested a method for the functional classification of proteins through pattern discovery. Further work extended to data from aligned sequences has been conducted by Wong et al. [9, 14, 15] and Chiu et al. [16–18] in the developing of pattern discovery and analysis. An important goal of our proposed work is to extend our understanding of sub-molecular, internal relationships within the 3D structure of proteins by analyzing their multiple sequence alignments. In this article, we introduce a powerful new form of analysis based on the concept of granular computing and the k-modes attribute clustering algorithm (k- modes algorithm for abbreviation) to reveal statistical associations among multiple amino acids, using the aligned sequence data of both the ubiquitin and transthyretin protein families as the test bed. We make several discoveries, including three types of multiple amino acid associations as well as the observation that some of them form nested hierarchical branches and modules within the larger structure, indicating that our proposed granular computing method is conceptually sound and renders new understanding of internal relationships within the 3D structure of globular proteins.
Consider a given alignment of multiple sequences, possibly representing the different members of a protein or gene family. A multiple sequence alignment is defined formally as follows.
Consider that a family of molecular sequences is properly aligned. Let the aligned sites be represented as X = (X1, X2, …, X N ) where N is the number of columns (aligned sites) in the multiple sequence alignment. Each aligned site is considered as an attribute. A realization of X is a particular sequence within the alignment and can be denoted as x i = (xi 1, xi 2,…, x iN ) where i is the row number within the alignment and x ij may assume any value in its alphabet set G. For proteins, G consists of the 20 amino acids and for DNA or RNA, G consists of the 4 nucleotides. We refer to an ensemble of outcomes of X as a Multiple Sequence Alignment
where Xn represents the column/aligned site/attribute number and x ij represents the particular amino acid or nucleotide found in row i and column j. For a protein family, the data for the two-dimensional array is the multiple sequence alignment for the family contained in databases such as Pfam [19, 20].
is the joint entropy between X i and X i' . Since I(Xi,X i' ) increases with the number of possible attribute values, R ii' must be normalized. This avoids biasing the search for associations among sites toward larger clusters, which may actually have a low level of internal interdependency. The function p refers to the estimated probability from the sample data.
One should note that the stronger the interdependency between X i and X i' , the higher the I(X i , X i' ) value. As indicated by Equation (2), from the statistics standpoint, the normalized R ii' also accounts for the amount of deviation from the independence hypothesis between X i and X i' . R ii' = 1 if X i and X i' are perfectly dependent and R ii' = 0 if they are completely independent. Since R ii' is a normalized interdependency measure, the interdependence relationship is not affected by the number of attribute values. If there is an interdependency between two attributes, there is a greater degree of correlation between them than when compared to two attributes that are less interdependent or independent. For this reason, R ii' is able to measure the interdependence or correlation between attributes [9, 15]. If R ii' > R ih , h ∈, h ≠ i ≠ i′, the dependency between X i and X i' is greater than that between X i and X h . In searching for higher-order relations between attributes in a sequence alignment, we use R ii' to measure the interdependence between attributes X i and X i' . On the basis of R ii' , a statistical test is introduced to test whether or not two attributes are interdependent (or are deviating from independence) . Recall that 0 ≤ R ii' ≤ 1, and R ii' = 0 if X i and X i' are totally independent, and R ii' = 1 if totally dependent. The test, in terms of R ii' , then becomes as follows.
Two attributes X i and X i' are dependent if R ii' ≥ c2(Gi–1)(Gi'–1)/2n H(X i , X i' ) and false otherwise, where c2(Gi–1)(Gi'–1) represents a chi-square distribution with (Gi–1)(Gi'–1) degrees of freedom (recall that for proteins G represents the alphabet set of 20 amino acids) and n represents the number of specimens in the cluster.
For sequences that presume to encode embedded functionality such as DNA, RNA, or proteins, I(X i , X i' ) can be considered to reflect the mutual functional information governed by known (or predicted) functionality of a pair of sites. A measure of functional complexity (FC) was proposed by Durston et al.  for the case where the sequences are specified by a defined input class. Here, however, we are measuring the FC of the relationships between aligned sites, as specified by the attribute cluster.
Within a cluster of attributes, we define the mode as the attribute with the highest normalized interdependency redundancy (SR) value. Formally, X i is the mode of cluster X if
SR(i) ≥ SR(i') for all attributes X i' in X.
For the purpose of this article, and to compare clusters of different numbers of attributes, the highest normalized interdependency redundancy (SR(i)) divided by the number of attribute pairs N* in the cluster (i.e., SR(i)/N*) will be designated as SR(mode). The SR(mode), therefore, in quantifying the interdependency between the mode and the other attributes within the cluster, provides one method to quantify the average degree of interdependency within the cluster, as well as an objective method to rank different attribute clusters in terms of their internal interdependency.
Often, within a sequence alignment, it may be that two or more sites are closely associated by their amino acid composition, possibly due to residue–residue contacts . It has been found that such amino acid associations could form compensatory relationships in certain mutational events . As a result, amino acids at one site could have a functional relationship with particular amino acids at the other sites within the associated sites. We define these amino acid associations as patterns if the association is statistically significant.
An event is an observation of an instance in a given data ensemble, either involving a single value or multiple jointly observed values. A single-valued event is referred to as a primary event and is a realization of X i that takes on a value from , where α represents a symbol from the alphabet set G. For a primary event, only one outcome of a variable in X is involved. A multiple valued event is referred to as a compound event. A pattern is a compound event (an observation from a subset of n different variables of the N-tuple X) which is statistically significant as reflected by its frequency of occurrences [23, 24]. A pattern is denoted as where (i, i', i'', …, n) is a set of aligned site indices (e.g., sites 2, 15, and 19), within the multiple sequence alignment. It represents the joint outcomes of , , …, and that such association is statistically significant in the sense that the frequency of their occurrences deviate significantly from the default expected frequency if they are just a random occurrence or if they are a completely independent set.
Attribute clusters with two high residual patterns
λ1 amino acids
λ2 amino acids
In traditional pattern recognition, clustering is a process that groups similar samples together, maximizing the intra-group similarity and inter-group dissimilarity based on certain similarity or distance measures. An example is the method proposed by Li and Scheraga  for grouping similar sequences on the basis of their information distance, the length of the smallest binary computer program that will convert two sequences to each other. Clustering algorithms that group according to similarity between attributes are not appropriate for clustering in terms of interdependency, since dissimilar attributes may be interdependent. (For example, two aligned sites may be composed of dissimilar amino acids, yet be interdependent.)
If we analyze the relationship among attributes with the objective of clustering them into groups, we consider their interdependence instead. Using a sample clustering analogy in clustering attributes, we should maximize the intra-group interdependence (correlation) and inter-group independence. Hence, the attribute clusters within the sequence alignment could be obtained by using attribute interdependence values for associations.
There is a subtle difference between attribute interdependence and statistically significant patterns. The former describes dependence between attributes such as sites within an amino acid sequence, while the latter relates an associated set of amino acids at a respective set of sites in a sequence as joint events based on their statistically significant association. (To clarify, taking the same example 11 in Table 1, we observe that three attributes are mutually dependent. Yet, within these three sites we observe two third-order events λ1 = NKE and λ2 = GDG which are statistically interdependent.) Hence, one third-order cluster of aligned sites can contain two or more third-order patterns of amino acids within those sites. We could anticipate that attribute clusters usually contain statistically significant patterns among the amino acids in the clusters. Before we define attribute clusters, we first introduce the algorithm by which attribute clusters could be found.
For sample clustering, a well-known process is the k medoids algorithm . To cluster samples into k groups, it first selects a random sample to represent each group and considers it as the center or medoid for that group. Thus, it selects k samples around which to build k groups. That is why the algorithm is referred to as the k-medoids algorithm. Using a distance measure, it clusters each of the remaining samples into the group for which the sample is closest to the medoid. After the first round, it updates a sample as the medoid of a group if the sum of distances of that sample to all other samples in that group is minimal. Based on their closest distance to these new medoids, all samples are then regrouped. The process is repeated until there is no more shift of samples between clusters. This method can work well for similarity-based clustering, but as pointed out earlier, interdependency might not be a function of similarity; two very dissimilar attributes may actually be interdependent. That is, different amino acids at distant sites could still be inter-associated if their occurrences are observed together.
The ETM, mentioned earlier, forms clusters of highly conserved sites that, when mapped to a 3D model, are correlated with a function such as binding. In this approach, an interdependency measure is not used. Instead, interdependency is inferred. Also, the interdependency that is inferred is between the function and the cluster of highly conserved sites. The sites (or attributes) may not necessarily all be interdependent between themselves. Furthermore, the ETM can be used only in proteins that have a known structure and are relatively free of noise.
The first step is initialization where k, the prescribed number of clusters, is inputted and an attribute is randomly selected for each cluster, representing a candidate for the mode of that cluster. Step 2 is to calculate the interdependency redundancy measure R ii' between each cluster mode and each of the remaining attributes. Each of these remaining attributes is then assigned to the cluster for which it has the highest R ii' value. Step 3 is to compute the updated mode for each of the k clusters (recalling that the initial mode was only a candidate used to start the process). In the subsequent iterative step, once the new mode of each of the k clusters has been calculated, steps 2 and 3 are repeated until the calculated mode for each of the k clusters does not change, that is, there is no exchange of attributes between attribute clusters. The algorithm is terminated at that point or, alternatively, at a pre-specified number of iterations.
For protein family multiple sequence alignments, all unassociated insertions and gaps are computationally removed from the previous procedure. Because it is generally recognized that sequence determines structure , and structure is important for function, it is assumed that each of the remaining sites within the sequence is either directly or indirectly interdependent with every other site in the sequence. To clarify, one pairwise interdependency may not be directly related to another attribute cluster consisting of three interdependent sites, but both clusters may be nested within larger clusters that are directly related to form a still larger cluster. An exception might be a cluster of sites that are interdependent only due to an external function, such as a highly conserved binding site. Thus, in order to discover the full range of the nested clusters, the k-modes algorithm is run for all values of k from k = N – 1 down to k = 2. For example, a multiple sequence alignment with 100 aligned sites would first be run with k set to N – 1 = 99. This would result in 1 pairwise cluster and 98 single attribute clusters to account for all 100 aligned sites. The value for k would be reduced by 1 each time all the way down to 2, which would yield 2 large attribute clusters comprising all 100 aligned sites. It is possible that the structure of a protein may not actually fit into two high-order attribute clusters. For example, a protein family that has a 3D structure composed of three structural domains may yield valid results for a k as small as 3 but not for a k as small as 2. If this is the case, it will become evident when the cluster tree of nested hierarchical attribute clusters is built, which will be explained shortly. Because the k-modes algorithm clustering method is data driven, it should yield the same results for multiple runs of the same k, even though a random attribute is chosen at the initial step. This was found to be the case for both protein families analyzed.
This order of computational complexity is manageable within a reasonable space and time on any recent laptop computer.
An attribute cluster is formally defined as follows.
For an ensemble of data X = (X1, X2, …, X N ) with N attributes, if X is partitioned vertically into k sub-arrays by the k-modes attribute clustering algorithm, then each sub-array is defined as an attribute cluster in the k-cluster configuration of X.
Representative clusters from each primary branch in ubiquitin
Highest pattern residual
Double bond between 7 and the other sites
11 is a binding site
12 is adjacent to binding site 11
Double bond between 18 and 21
Double bond between 23 and other sites
27 is a binding site
30 in Van der Waals contact with 41
29 is a binding site
Van der Waals contact 34 & 36
34 is adjacent to binding site 33
37 bonds to 40
43 is adjacent to possible external recognition site 42 and external contact site 44
Van der Waals contact 44 & 47
sites 44 and 47 are external contact points, 47 is adjacent to major binding site 48
site 49 is an external contact point
8 internal H-bonds
63 is a major binding site
site 70 is an external contact point
43 in tight Van der Waals contact with 69
74 is a possible external recognition site
70 is an external contact point, 73 is an external recognition site
In this study, we consider only those patterns observed within an attribute cluster even though some might span across attribute clusters with weak interdependence. The set of attributes (as random variables) within an attribute cluster is then defined as a new pattern subspace as below.
Consider a subset Y of N* (N ≥ N* > 1) attributes of X such that at least one pattern is spanning that subset. Let denote the set of m patterns contained in the subset Y. We refer the vector subspace Y as a significant pattern subspace containing the m patterns which span that subspace.
For example, in Table 2, cluster 11 forms a transformed attribute, consisting of the attribute set of aligned sites 60, 63, and 64. The attribute cluster 11 is not the pattern, rather, it is a result contributed by two amino acid patterns λ1 = NKE and λ2 = GDG at two different subsequent instances in the significant pattern subspace (second row in Table 1). Note that the variables in a subset of X forming a subset Y in the new significant pattern subspace will not be overlapped, as our attribute clustering algorithm will ensure that they are disjoint. Later, we will discuss how a hierarchical structure can be constructed in the form of a cluster tree.
The concept of granular computing allows analysis of granular units, in terms of modularity and hierarchy as a web structure of relationships [29–36]. Applying this concept to proteins entails analyzing them in terms of a hierarchical assembly of modules, each of which is associated with an attribute cluster.
Within a group composed of different types, a granule was originally conceptualized and served as a focal point of similar types within the group to study their collective properties . For our problem of understanding the internal association relationships of a protein, the clusters of sites in the alignment can be considered as a set of granular units (which we called information granules) such that their information reflects the inferred associative properties of the protein family under study. A low-order attribute cluster may represent two or three residues that are in contact through H-bonds or in van der Waals interaction distances. The two or three residues can also be more distant from each other, yet be mutually associated, possibly through the effects the members of the cluster have on torsion angles and free energy during folding. Low-order clusters may be nested within higher-order clusters, which represent important sub-domain structural components.
As already mentioned, each higher-order cluster should contain most of the sites included in the lower-order cluster(s) that are nested within it by choosing a reasonable number of clusters, or a k value that is not too small. The completed tree should be a valuable tool in identifying and understanding sub-molecular relationships, both functional and structural, within the generalized 3D structure of the protein family being investigated.
Each information granule is defined as an attribute cluster and allows for local analysis, as well as a global analysis under this framework. The hierarchical levels of analyses then provide multiple views into the protein structure including key residue–residue contacts, pairwise, third- and fourth-order relationships with multiple sites on up to larger sub-domains.
Table 2 lists 14 attribute clusters within the aligned sequences of the ubiquitin family. Each of these reflects complex variability and association information that can be analyzed and compared to the functional and structural characteristics necessary for the ubiquitin family. By studying these generated granules, detailed insights can be gained on the sub-molecular characteristics of the ubiquitin family.
To summarize our approach, Figure 1 shows a flow chart diagram for the attribute cluster analysis of protein families. The analysis begins with the downloading of a sequence alignment of all sequences predicted for a protein family. Redundant sequences are then removed so that the resulting alignment consists only of unique sequences for that family. The resulting alignment is then analyzed using our method to detect attribute clusters. The attribute clusters are then grouped into hierarchical families of clusters forming branches, some of which may form a hierarchical cluster tree, an example of which can be seen in Figure 2. The clusters are then analyzed for structural and functional contributions with the objective of better understanding the key internal hierarchical structural sub-domain relationships and at various levels.
Our approach is to locate statistically significant associations among sites in a multiple sequence alignment, as well as statistically significant patterns of amino acids within those associations. For locating associations between sites, the normalized mutual information R ii' (Equation 1) between attributes X i and X i' , known as redundancy, is used.
where νλ is the variance of δλ. To find an amino acid pattern within a given cluster, a lower cutoff value for the adjusted residual is chosen. The adjusted residuals are calculated based on the amino acids observed within the multiple sequence alignment within the particular cluster being examined (i.e., only one cluster is analyzed at a time, which reduces the computational search to just the aligned sites in the cluster). An adjusted residual that is less than the cutoff value is rejected and an adjusted residual higher than the cutoff value is retained as significant to a certain confidence level. (For example, an adjusted residual of 2.58 means that an amino acid pattern is statistically significant to a confidence level of 99%, and an adjusted residual of 3.29 is statistically significant to a confidence level of 99.9%.)
In this way, only statistically significant results are obtained and spurious results preempted. Thus, statistical significance is the operative criterion throughout the search process, ensuring statistically significant results. For a multiple sequence alignment, the minimum number of non-redundant aligned sequences required for statistically significant results is five . However, only very strong associations and patterns will be revealed with such a small sample size. Adding additional non-redundant sequences to the alignment will increase the likelihood of finding additional patterns within the same association of sites, and of locating the smaller associations among sites that may be nested within the stronger, larger associations. Thus, the larger the non-redundant multiple sequence alignment, the more attribute clustering detail can be resolved.
Overview of experimental results and predictions
Predictions for further research
Attribute clusters can be grouped into a hierarchical cluster tree composed of branches, nodes, and modules
Cluster trees may reveal details of folding constraints and other functionality relationships
Within modules, two types of branches were found: (i) Type I; independent/non-interlacing with other branches and (ii) Type II; interlacing with other branches
The relationship between Types I and II branches may indicate constraints in the folding, tertiary structure and functionality of protein molecules
Attribute clusters with highest SR(mode) values were most commonly third- and fourth-order (3 to 4 associated sites)
Support for the next largest building block in proteins above single amino acids is typically a 3 to 4 amino acid structural unit
Three types of attribute clusters were found: (i) H-bonded clusters, (ii) Van der Waals clusters, and (iii) extended clusters
Identifying attribute clusters may support possibility of predicting protein tertiary structural relationships involving H-bonds, van der Waals interactions, or multiple site effects
Representative clusters found in ubiquitin confirm that the statistical criterion used can identify structural constraints such as bonding, binding and recognition sites (Table 2)
Can be used to locate key sub-molecular components of a protein, as well as components critical for the function of that protein
Ubiquitin molecule found to be composed of four major modules
Consistent with the four major areas of chemical shift perturbations between Ub1 and Ub2
A cluster tree was assembled as described in the Section “Attribute clusters and information granules”, and is shown in Figure 2. Two types of branches can be observed. A Type I branch is made up of aligned site clusters which exhibit little or no interlacing with clusters residing in other branches (for example, primary branch (1) in Figure 2). The lack of interlacing between a Type 1 branch and other branches may indicate that the residues within a Type I branch are assembled completely prior to folding into the larger structure. The Type II branch, as illustrated by branches (5), (6), and (7), is made up of attribute clusters containing sites that interlace with sites in other branches. The overlapping/interlacing sites in the attribute clusters among Type II branches suggest that there is a strong structural relationship among those branches, giving robust structural stability in the region containing them. A prediction that arises out of the observation of two neighboring, strongly interlaced Type II branches, is that during translation, the two branches may fold into each other before folding as a cohesive unit back onto the part of the protein that has already been formed by earlier branches.
Figure 2 reveals that the 14 primary branches converged into 4 primary modules (labeled in Figure 2 as modules 1–4). Modules formed as the result of merging two or more primary modules are labeled as modules A and B (Figure 2). For example, modules 1 and 2 converged to form module A. Since branches 8, 9, and 10 were strongly interlaced, and showed no signs of converging with branches 11–14, branches 8–10 were grouped to form their own module 3. The strong interlacing of branches 11–14 resulted in the formation of module 4. Once the cluster tree was completed, the next steps were to analyze a representative cluster from each branch, and then analyze the four modules. The next three sections contain the results of these steps.
For the next step, we chose a set of representative clusters for detailed examination. Any cluster can be chosen to represent a primary branch, but for ubiquitin, the cluster with the highest SR(mode) value was chosen to be the representative cluster. The representative cluster for each branch is shown highlighted in red in Figure 2 and labeled 1 to 14, where the numeral denotes the primary branch number. Table 2 contains a list of the representative clusters for the primary branches 1–14 together with the highest adjusted residue value of the strongest pattern within each representative cluster.
Using the secondary structure revealed in the X-ray diffraction 1.8 angstrom solved structure 1UBQ [37, 38] available from the Protein Data Bank, of the representative attribute clusters in each branch, we observe that (a) 11 out of 14 clusters included loops or turns; (b) 9 out of 14 included beta strands; and (c) 5 out of 14 included helices. We also note that three of the clusters occurred entirely within loops or turns, with only one cluster occurring entirely within a helix and only one entirely within a beta strand. Since loops and turns can be highly constrained structurally, it may explain why a higher occurrence of clusters included loops and turns.
Another observation from the representative clusters (highlighted in red in Figure 2) is the percentage of fourth-order (involving four sites) clusters. Of the set of clusters, one from each primary branch, that had the highest SR(mode) value (i.e., the strongest average interdependency between sites), 7 were fourth-order (50%), 5 were third-order (36%), sixth-order clusters comprised only 14%, and there were no fifth-order high SR(mode) value clusters in the individual branches. This suggests that the most common structural units within proteins may involve 3 to 4 amino acids. The node for Module 1 composed of the 21st-order cluster (6–25, 27) had a high SR(mode) value of 0.202. This high internal interdependency may be a strong predictor of a structural sub-domain. A high SR(mode) value at the node of a module indicates the level of internal interdependency and stability once the folding is complete for that module.
The k-modes algorithm found one or more patterns for each of the representative clusters. Recall that a pattern consists of an association of specific amino acids among the sites in the cluster. For statistical significance, an adjusted residual value was calculated for each pattern in a cluster. The residual value for each of the representative clusters shown in Table 2 is only for the pattern with the highest residual found within each representative cluster. Many of the clusters had other patterns as well. For example, two patterns for each of three different representative clusters are shown in Table 1. The fact that patterns were discovered for each of the representative clusters is verification that the attributes clusters discovered by this method contain high residual associations of amino acids.
Sites 8, 44, and 70 form hydrophobic patches that appear to contact the corresponding sites in Ub2, stabilizing the dimer in its closed conformation .
Sites 47, 49, and 71 form hydrogen bonds with corresponding sites in the other moiety in Lys-48 linked Ub2.
Sites 42, 72, and 74 are located at the interface between the two ubiquitin domains, which may represent possible interaction sites that distinguish multi-ubiquitin chains from a single ubiquitin molecule, providing a means of external recognition .
Using these comparisons, the representative clusters (1–14) in each branch were then examined. All of the major clusters were found to be either functionally or structurally significant (Table 2). Some clusters in particular were found to have an important internal structural relationship and are discussed in detail below. Even though the k-modes algorithm is blind to the position of a site in the sequence, most of the discovered clusters contain sites that are either consecutive, or very close in sequence proximity, which one could expect to be structurally associated. This result suggests that the method presented here is able to pick up local structural associations remarkably well. Furthermore, there are some cases where the associated sites of the clusters are located at a distance (e.g., sites 30, 43, 68, and 69) with binding site relationships, or other structural or functional relationships (Table 2).
In addition to the Types I and II branches describing how the sites are interlaced, at least three types of attribute clusters were also observed which we designate: H-bonded clusters, van der Waals clusters, and extended clusters. H-bonded clusters contain residues that are in mutual H-bond contact. An example of this type of cluster is (1) shown in Figure 3B, forming a loop stabilized by no less than six predicted (PyMol) internal H-bonds. The second type of cluster, the van der Waals cluster, has sites that are not bonded but are in van der Waals contact. An example of this type is cluster (12 s), modeled in Figure 3C. The 1UBQ model reveals that the last cluster in the primary branch (12) (although quite dispersed along the primary sequence) contains van der Waals interactions, as well as predicted H-bonds, with sites 43 and 44 in van der Waals contact with sites 68 and 69, and two additional hydrogen bonds predicted by PyMol between site 44 and site 68. Hence, the clusters help to identify sites that are closely related in the tertiary structure, and often in actual contact, though widely separated in the primary sequence.
These first two types of clusters predicted by the pattern recognition software used in this research can readily be corroborated by mapping the attribute clusters to a solved 3D structure for the protein. The third type, however, designated as the extended cluster, is composed of sites that are distant both in the primary sequence, as well as the final 3D structure, yet they have a very high SR(mode) value, indicating strong association. When mapped onto the solved 3D structure of the protein, it is less obvious why the constituent sites form an attribute cluster. There are two possible reasons why some clusters may consist of sites that are not within van der Waals interaction distance from each other in the 3D structure, yet demonstrate a strong interdependency. First, it is possible that for extended clusters, the interdependence is due to the net effect of the conformation angles ψ and ϕ and the conformations of the side chains for the amino acids within the cluster  upon the short and medium range structure and free energy state of the protein in the area containing the extended attribute cluster. This net effect may be required to minimize free energy and directly maintain the stable tertiary structure. Second, the association of sites that are not structurally proximate to each other may be due to external functional constraints such as external binding as first conjectured by Wong et al. . From Table 2, it can be observed that some clusters with no obvious internal structural relationship, such as 9, 11, 12, and 14, all contain external binding sites or, where the interaction is not yet known, external contact points with the external molecule it binds to . Thus, the mutual interdependency of the sites in an extended cluster may be due to external binding requirements of functional controls where a number of sites are involved.
The success of this method in identifying important structural associations within ubiquitin does not mean that it will identify all such bonds, especially when the association among sites is weak. It does, however, demonstrate the ability to identify some key residue–residue contacts from the alignment data. This relationship among statistical association, molecular structure, and functionality is also consistent with earlier findings that identified the relationship among statistical patterns and 3D bonding structure  and functionality such as heredity factors in cancer suppressor genes [16, 17]. The current results bring to light further details of statistical patterns that are strongly indicative of molecular structure and functionality in an alignment of proteins. Further work needs to be done to enable distinctions among clusters that predict residue-residue contacts, extended clusters that are keys in achieving structural stability, and those associated with external binding sites. One possible approach which has not been explored is to compare the FC of each site, which may be more heavily influenced by external binding, with the site clusters . Another approach would be to compare the results of steered molecular dynamic computational modeling [44, 45] on the more interesting parts of the cluster tree to see how those results would correlate with the cluster tree.
Module and perturbation area comparison
Sites spanned in module
Sites spanned in perturbation areas of Ub2
The tertiary structure of Module 1 is shown in Figure 3B. The cluster with the highest SR(mode) value, discovered in our experiment, was cluster (1), with a SR(mode) value of .26. Cluster (1) is also labeled in Figure 2 and consists of sites (7, 9–11). The high SR(mode) value shows that the sites are highly associated. From the 1UBQ model in Figure 3B, no less than six H-bonds are predicted by the MacPyMol model between Thr-7 and the other members of the cluster, sites 9–11. The number of internal H-bonds not only explains why it has such a high SR(mode) value, but it also indicates that site 7 is especially important within the cluster. Furthermore, the presence of the six predicted bonds in the cluster is an indication that cluster (1) provides structural stability to the loop and as a result, may have an important role in folding the two beta strands into a beta sheet. Site 11 within cluster (1) is also a known binding site  (Table 2).
Cluster (3) also has one of the higher SR(mode) values in the alignment, with a value of 0.19. From Figure 3B, it can be observed that this cluster, containing sites (18–19, 21) is part of the next loop after the beta sheet. In this particular cluster, there are two predicted bonds between Asp 21 and Glu 18, providing verification for the significance of the computationally discovered association. The double bond has the effect of providing structural rigidity to the loop.
From Figure 2, the core of Module 2 is composed of three interlacing Type II branches 5 to 7, made up of attribute clusters from sites 29 to 43. An examination of the clusters for each branch reveals that the sites in each of the three clusters straddle sites in the other two. For this reason, these three branches appear to form a very stable, interlocking structural unit. The high internal structural stability predicted by the Type II branches may explain why, in earlier work by Varadan et al. [40, 41], the area spanned by module 2 was observed to demonstrate significantly lower chemical shift perturbations than those observed in the areas spanned by the other three modules. It is often the case that sites within a cluster, though separated in the primary structure of sequence, may have members that are close once folding has occurred and may actually be bonded to each other. Sites 37 and 40 in cluster (7) are an example of this common occurrence. The 1UBQ model also provides a clear interpretation for cluster (5), by revealing the van der Waals relationship within the cluster between sites 30 and 41.
The unique feature of this module stands out in the cluster tree (Figure 2) as it reveals its relative independence from the other modules. This leads to a prediction that the branches in module 3 may fold into a stable sub-domain first, and then the entire sub-domain folds into the already folded module A to form module B. It is also notable that module 3 has a very close correspondence to the previously discovered third perturbation area shown in Table 4[40, 41].
Recent work involving ubiquitin unfolding kinetics using hydrogen/deuterium exchange methods correlates with the results found here [46, 47]. For example, the B5 strand, represented by module 4, appears to unfold before the loop and helix represented by primary branch 10 in module 3. This echoes what the cluster tree reveals regarding the independence of module 4. The cluster tree method demonstrated here may work very well in conjunction with unfolding studies in revealing details of protein folding.
Transthyretin (TTR) is a homotetramer composed of a dimer of dimers. The tetramer contains a solvent channel within which is a binding pocket for two thyroxine molecules. Each dimer interface includes sites 114 to 120, forming an 8-strand, anti-parallel β-sheet, which constitutes the wall of the solvent channel. TTR is often found in complex with Retinol Binding Protein (RBP). At the center of this interface for human TTR are sites 20, 81, and 84 . There are some TTR variants that result in amyloid formation. Two of the most common forms are Val30Met and Leu55Pro variants, both of which produce amyloidosis .
The overall cluster tree for TTR is remarkably different from the ubiquitin cluster tree in that there are large branches in TTR that do not converge into the rest of the tree. This may suggest that TTR does not have the internal stability that ubiquitin has, and may be more prone to misfolding. Further work on other proteins that are prone to misfolding would be warranted here, to see if non-converging cluster trees are characteristic of potential instability. Those results would need to be compared with the cluster trees for strongly stable proteins. An overview of our results and predictions is provided in Table 3.
We have introduced here a powerful new approach for analyzing the multiple sequence alignment data of protein families and discovering key associations among aligned sites and their importance within the 3D structure. Using two proteins of known structure and function as a test bench, our method revealed key associations among residues and sites that appear to have important structural and functional significance. It can, therefore, be applied to protein families of unknown structure and function. From our work, statistically analyzing the multiple aligned sequences for a protein family through the pattern discovery method presented in this paper revealed many key residue–residue contacts, as well as the sub-domain structure of ubiquitin and TTR. When the discovered attribute clusters were arranged according to their order, cluster trees were constructed which rendered further insights (dependent upon the hierarchy) into how the protein may fold. With our site clustering results, the secondary structure of ubiquitin forms four modules, which are closely associated with the four regions of perturbation in Lys-48 and Lys-63 linked Ub2 previously reported. Two categories of cluster branches, Types I and Type II, as we proposed, were discovered that may render a more a detailed understanding of how the protein folds. Furthermore, we observe that three types of attribute clusters were identified by our method, H-bonded, van der Waals, and extended clusters associated with binding sites. Such observations give additional insights into which associations of aligned sites make key contributions to a protein’s structural stability. Using the binding sites in the 1UBQ model and previous discoveries in the literature, we validate that the discovered clusters have both structural and functional significance. The TTR cluster diagram revealed further secrets of its 3D structure. Six short, non-converging branches were found, all of which contain clusters that have important structural or functional significance. Only one multi-branch sub-domain module was found for TTR, associated with the interface between the two monomers. The number of non-converging branches into large modules, however, suggests that TTR may be prone to instability when folding.
The method presented here, backed by previous work with ubiquitin and TTR, suggests that granular computing as a concept can make an important new framework for revealing the relationship between low-order (three or four residues) residue–residue contacts and the demarcation of higher-order sub-domains, using a cluster tree. This sub-molecular hierarchical view also identifies sites within a protein that may be of particular structural or functional importance in the design of new drugs, for example. The ability to discover key residue–residue contacts, branches, and larger structural sub-domains within a protein through the k- modes analysis of the multiple sequence alignment will be a significant asset in understanding the details in the sequence of protein folding, structure, and functionality among different residue locations within a hierarchical global protein framework. Furthermore, by discovering the important attribute clusters within a protein, predictions can also be made as to which mutations could be more harmful or more stable than others. All these play an important role in furthering our understanding of the information processing capability of genes and proteins, in terms of the specific use of functional units at specific locations on the sequence to create the 3D structure as well as the internal and external functionality of the molecules.
The ubiquitin protein family was chosen as a suitable workbench to test our method for three reasons. First, the number of samples is reasonably large, numbering a few thousand, permitting a meaningful statistical analysis. Second, the structure of ubiquitin is well known and a 3D model is available from the Protein Database . The available 1UBQ 3D model permitted us to compare the results and analyze their implication in terms of their structure and folding. Finally, there is considerable knowledge of the functions of ubiquitin, the most well known of which is the labeling of proteins for destruction within the cells of eukaryotic life [39, 55].
In our experiment, an aligned set of 2,442 sequences for ubiquitin was downloaded from the Pfam database . The data were then computationally post processed to retain columns less than 20% gapped. After the insertions and duplicate sequences were removed, the number of unique sequences remaining was 1,066 and the number of aligned sites remaining was 69. The alignment did not include the first 5 sites that were missing from the Pfam alignment at the time we performed our analyses. These 69 sites represented sites 6–74 of the ubiquitin primary structure using the 1UBQ 3Dmodel from the protein data base .
At the onset of our experiment, the pre-processed data, consisting of 1,066 aligned, unique sequences was then analyzed using the k-modes algorithm for discovering clusters of attributes (sites) within the primary structure of ubiquitin. The data were also analyzed to compute the SR(mode) value for each cluster.
The first step in analyzing the results was to manually arrange the clusters into a cluster tree as described earlier in the Section “Attribute clusters and information granules”. Branches containing the lowest-order clusters are denoted as primary branches and are numbered from 1 to 14 across the top of Figure 2. Within a nested hierarchical set of clusters, each cluster will be informative to examine; there is no official representative cluster. One way of choosing a representative cluster is to pick out clusters that have higher mutual interdependency than others, indicated by a high SR(mode) for the cluster. For the ubiquitin family only, this method was chosen, although any other cluster could have been picked as a representative cluster for a particular branch. The first step was to compute the SR(mode) value for each cluster in the primary branch. The cluster in each branch with the highest SR(mode) value was chosen as the representative cluster, since it had the highest mutual interdependency. If two clusters in the same branch had similar values for SR(mode) the cluster with the higher order was selected, since the higher-order cluster would contain more sites and, thus, be more informative and interesting. For the same reason, if a higher-order cluster in the branch had a SR(mode) value that was only slightly lower than the highest SR(mode) value of another, lower-order cluster, the higher-order cluster instead was chosen as the representative cluster for that primary branch. For consistency, a certain percentage difference can be chosen as the cutoff. These representative clusters, each composed of two or more sites with a high interdependence as indicated by the SR(mode) value, were then computationally analyzed to determine what patterns (associations of specific amino acids within the sites spanned by the attribute cluster) were contained within each representative cluster. The residual of each of the patterns was also computed. The discovered attribute clusters of ubiquitin were examined in terms of their relevance to the 3D structure using the 1UBQ model. They were also analyzed for function as described in the literature. Statistically significant patterns were computationally discovered by searching for amino acid combinations that had an adjusted residual value greater than 3.29, which corresponds to a confidence level greater than 99.9%. A multiple sequence alignment for TTR, downloaded from Pfam , was also analyzed using the same procedure outlined above for ubiquitin with the exception as to how the representative clusters were chosen. Since all clusters are statistically significant, any cluster can be chosen as a sample cluster. For TTR, we chose the third-order clusters as the sample clusters. Since the criterion was different for TTR, they were referred to as sample clusters rather than representative clusters. Ideally, if time and resources permit, all clusters should be examined to more fully understand their significance.
Evolutionary trace method
The set of 20 normally occurring amino acids
Number of sequences in a multiple sequence alignment
Number of sites in an individual protein or multiple sequence alignment
Nuclear magnetic resonance
Retinol binding protein
Normalized redundancy measure
Normalized interdependency redundancy
Represents a cluster of mutually associated sites
Represents a particular amino acid
Represents a pattern of amino acids.
The authors are grateful to the Natural Science and Engineering Research Council of Canada (NSERC) Discovery Grants; the Advanced Food and Materials Network (AFMnet), the National Centre of Excellence of Canada; and the National Natural Science Foundation Grant of China.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.