A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class
© Ravi Gupta et al. 2008
Received: 28 May 2007
Accepted: 10 March 2008
Published: 26 March 2008
This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods.
Determination of protein structure from its primary sequence is an active area of research in bioinformatics. The knowledge of protein structures plays an important role in understanding their functions. Understanding the rules relating the amino acid sequence to the three-dimensional structure of the protein is one of the major goals of contemporary molecular biology. However, despite more than three decades of both experimental and theoretical efforts prediction of protein structure still remains one of the most difficult issues.
The concept of protein structural classes was originally introduced by Levitt and Chothia  based on a visual inspection of polypeptide chain topologies in a dataset of 31 globular proteins. A protein (domain) is usually classified into one of the following four structural classes: , , , and . Structural class categorizes various proteins into groups that share similarities in the local folding patterns. The and classes represent structures that consist of mainly α-helices and β-strands, respectively. The and classes contain both α-helices and β-sheets where the class includes mainly parallel α-helices and β-strands and class includes those in which α-helices and β-strands are largely segregated. Prediction of structural classes is based on identifying these folding patterns based on thousands of already categorized proteins, and applying these patterns to unknown structures but known amino acid sequences. Structural Classification of Proteins (SCOP)  is one of the most accurate classifications of protein structural classes and has been constructed by visual inspection and comparison of structures by experts.
In the past two decades several computational techniques for prediction of protein structural classes have been proposed. Prediction is usually a two-step process. In the first step a fixed length feature vector is formed from protein sequences which are of different length. The second step involves a classification algorithm. Klein and Delisi  proposed a method for predicting protein structural classes from amino acid sequence. Later on, Klein  presented a discriminant analysis based technique for this problem. Zhou et al.  in 1992 proposed a weighting method to predict protein structural class from amino acids. A maximum component coefficient method was proposed by Zhang and Chou . A neural network based approach  for protein structural classes was also developed using six hydrophobic amino acid patterns together with amino acid composition. A new algorithm that takes into account the coupling effect among different amino acid components of a protein by a covariance matrix is proposed in . In , Chou and Zhang introduced Mahalanobis distance to reflect the coupling effect among different amino acids components, improving the accuracy of the current problem. A support vector machine (SVM) method using amino acid composition features for prediction of protein structural class was presented by Cai et al.  in 2001 and is one of the most accurate methods for classification. A supervised fuzzy clustering approach based on amino acid composition features was introduced by Shen et al. . A combined approach, LogitBoost, was proposed by Feng et al. . It combines many weak classifiers together to build a stronger classifier. In 2006, Cao et al.  proposed a rough set algorithm based on amino acid compositions and 8 physicochemical properties data.
In this paper, a three step procedure is proposed for prediction of protein structural class. The main contribution of this paper is in providing a novel feature vector which is obtained by applying a wavelet-based time-series analysis approach. The proposed feature extraction from protein sequence is inspired from the work of Vannucci and Lio  on transmembrane proteins. The fixed length feature vector for classification proposed is derived from ten physicochemical properties of protein sequences. The physicochemical properties are used to convert the protein sequences from symbolic domain to numeric domain and to derive a time series representation for protein sequences. Features are extracted by applying a wavelet-based analysis technique for time series data on mapped protein sequences. The feature vector summarizes the variation of physicochemical properties in the protein sequence. Finally, a support vector machine is trained using the novel feature vector and the parameters are optimized for generating accurate model (providing highest prediction accuracy).
Leave-one-out cross-validation also called jackknife test was performed on the datasets that were constructed by Zhou  from SCOP. The datasets were also used by Cai et al. , Cao et al.  for their experiments. An overall accuracy of 82.97% and 93.94% was achieved for 277 domains and 498 domains datasets, respectively, using the proposed approach.
The paper is organized as follows. In Section 2, we describe the steps followed for extracting wavelet variance features from protein sequences. A brief introduction to support vector machine (SVM) is also provided in this section. Section 3 provides the experiment results obtained for datasets of structural protein sequences. Conclusion follows in Section 4.
The proposed approach for identification of structural classes of proteins is divided into three different stages: amino acid mapping, feature extraction, and classification. In the first stage the protein sequences are mapped to various physicochemical scales as provided in the literature. After this mapping procedure the protein sequences become discrete time series data. The second stage involves construction of fixed length feature vector for classification. The feature vector is generated by combining wavelet variance  features extracted from different physicochemical scales used for mapping stage. Finally, an SVM-based classification is performed based on the novel extracted features to identify the structural class of a protein sequence.
2.1.Amino Acid Mapping
In this stage, ten different physicochemical amino acid properties were used. The first is the average flexibility indices provided by Bhaskaran and Ponnuswamy . The second is the normalized hydrophobicity scales provided by Cid et al. . The third is the transfer free energy given by M. Charton and B. I. Charton  and cited by Simon . The fourth is the residue accessible surface area in folded protein provided by Chothia . The fifth is the relative mutability obtained by multiplying the number of observed mutations by the frequency of occurrence of the individual amino acids and is provided by Dayhoff et al. . The sixth is the isoelectric point provided by Zimmerman et al. . The seventh is the polarity of amino acids provided by Grantham . The eight is the volume of amino acid provided by Fauchere et al. . The ninth is the composition of the amino acids provided by Grantham . The tenth is the molecular weight of the amino acids given by Fasman . The numerical indices representing physicochemical property of amino acids were downloaded from http://www.genome.jp/dbget.
2.2. Feature Construction
The representation of a protein sequence by a fixed length feature vector is one of the primary tasks of any protein classification technique. In this section, we present a wavelet-based time-series approach for constructing feature vector. Wavelet transform is a technique that decomposes a signal into several groups (vectors) of coefficients. Different coefficient vectors contain information about characteristics of the sequence at different scales. The proposed feature vector contains information about the variability of ten physiochemical properties of protein sequences over different scales. The variability of physiochemical properties is represented in terms of wavelet variance .
In the present work, a variation of the orthonormal discrete wavelet transform (DWT) , called the maximal overlap DWT (MODWT)  is applied for feature extraction. In past, MODWT has been applied for analysis of atmospheric data  and economic time series data . The MODWT is a highly redundant and nonorthogonal transform. The MODWT was selected over DWT because it can handle any sample size N, while J th order DWT restricts the sample size to multiple of . The property is very useful for analysis of protein sequences, as the length of the sequences is not a multiple of . In addition, MODWT yields an estimator of the variance of the wavelet coefficients that is statistically more efficient than the corresponding estimator based on the DWT.
The physiochemical variation of a protein sequence is summarized in the proposed feature vector. The dimension of is equal to and is dependent on the number of levels (J) to which the time series data (i.e., protein sequence) has to be decomposed. The value of J is further dependent on the length of time series data (i.e, protein sequence length) and , where N is the number of observation points in the time series or the length of protein. As most of the protein sequences taken up for the experiment have length greater than 32, we have selected . In this study, Daubechies  wavelet has been used for analysis.
The SVM was proposed by Cortes and Vapnik  as a very effective technique for pattern classification. SVM is based on the principle of structural risk minimization (SRM), which bounds the generalization error to the sum of training set error and a term depending on the Vapnik-Chervonenkis dimension  of the learning machine. The SVM induction principle minimizes an upper bound on the error rate of a learning machine on test data (i.e., generalization error), rather than minimizing the training error itself which is used in empirical risk minimization. This helps them to generalize well on the unseen data.
An open-source SVM implementation called LIBSVM  was used for classification. It provides various kernel types: radial basis function (RBF), linear, polynomial and sigmoid. Experiments were conducted using different kernels; however the RBF was selected because of its superior performance for the current work. Further, for finding the optimum values of parameters for RBF kernel, LIBSVM provides an automatic grid search technique using cross-validation. Basically various pairs of are tried and the one that provides best cross-validation accuracy is selected.
3. Experimental Results
Dataset for the current study.
The performance of the SVM classifier is measured using leave-one-out cross-validation (LOOCV) technique. LOOCV is n-fold cross-validation, where "n" is the number of instances in the datatset. Each instance in turn is left out, and the learning method is trained on all the remaining instances. It is judged by its correctness on the remaining instances-one or zero success or failure, respectively. The results of all "n" judgments, one for each member of the dataset, are averaged, and that average represents the final error estimate.
Experimental result of one-versus-others test on dataset1 evaluated using LOOCV.
True positive (TP)
False negative (FN)
True negative (TN)
False positive (FP)
Area under curve (AUC)
Optimal SVM parameters
Experimental result of one-versus-others test on dataset2 evaluated using LOOCV.
True positive (TP)
False negative (FN)
True negative (TN)
False positive (FP)
Area under curve (AUC)
Optimal SVM parameters
Comparison of Leave-one-out cross-validation accuracy obtained for protein structural classification problem on the two datasets by our approach and existing approaches.
Prediction accuracy for each structural class (%)
Overall accuracy (%)
Component coupled 
Neural network 
Rough sets 
Component coupled 
Neural network 
Rough sets 
In this work, we have presented a novel wavelet variance based feature vector for prediction of protein structural class. The aim of this research is to provide a new and complementary set of features for the current problem. Based on pattern recognition framework, the proposed approach is divided into three different tasks: amino acid mapping, feature construction, and classification. The feature vector summarizes the variation of ten different physicochemical properties of amino acids. The feature extraction technique is based on wavelet based time series analysis. Experiments were performed on two standard datasets (constructed by Zhou ). The result of LOOCV test shows that the proposed method achieves accuracy better than existing methods. The proposed approach can also be applied for identification of membrane protein type, enzyme family classification, and many others.
- Levitt M, Chothia C: Structural patterns in globular proteins. Nature 1976,261(5561):552-558. 10.1038/261552a0View ArticleGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of protein database for the investigation of sequence and structures. Journal of Molecular Biology 1992,225(4):713-727.Google Scholar
- Klein JP, Delisi C: Prediction of protein structural class from the amino acid sequence. Biopolymers 1986,25(9):1659-1672. 10.1002/bip.360250909View ArticleGoogle Scholar
- Klein P: Prediction of protein structural class by discriminant analysis. Biochimica et Biophysica Acta 1986,874(2):205-215. 10.1016/0167-4838(86)90119-6View ArticleGoogle Scholar
- Zhou G, Xu X, Zhang C-T: A weighting method for predicting protein structural class from amino acid composition. European Journal of Biochemistry 1992,210(3):747-749. 10.1111/j.1432-1033.1992.tb17476.xView ArticleGoogle Scholar
- Zhang C-T, Chou K-C: An optimization approach to predicting protein structural class from amino acid composition. Protein Science 1992,1(3):401-408.View ArticleGoogle Scholar
- Metfessel BA, Saurugger PN, Connelly DP, Rich SS: Cross-validation of protein structural class prediction using statistical clustering and neural networks. Protein Science 1993,2(7):1171-1182. 10.1002/pro.5560020712View ArticleGoogle Scholar
- Chou K-C: A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins: Structure, Function and Genetics 1995,21(4):319-344. 10.1002/prot.340210406View ArticleGoogle Scholar
- Chou K-C, Zhang C-T: Predicting protein folding types by distance functions that make allowances for amino acid interactions. Journal of Biological Chemistry 1994,269(35):22014-22020.Google Scholar
- Cai Y-D, Liu X-J, Xu X-B, Zhou G-P: Support vector machines for predicting protein structural class. BMC Bioinformatics 2001, 2, article 3: 1-5.Google Scholar
- Shen H-B, Yang J, Liu X-J, Chou K-C: Using supervised fuzzy clustering to predict protein structural classes. Biochemical and Biophysical Research Communications 2005,334(2):577-581. 10.1016/j.bbrc.2005.06.128View ArticleGoogle Scholar
- Feng K-Y, Cai Y-D, Chou K-C: Boosting classifier for predicting protein domain structural class. Biochemical and Biophysical Research Communications 2005,334(1):213-217. 10.1016/j.bbrc.2005.06.075View ArticleGoogle Scholar
- Cao Y, Liu S, Zhang L, Qin J, Wang J, Tang K: Prediction of protein structural class with rough sets. BMC Bioinformatics 2006, 7, article 20: 1-6.Google Scholar
- Vannucci M, Lio P: Non-decimated wavelet analysis of biological sequences: applications to protein structure and genomics. Sankhya B 2001,63(2):218-233.MathSciNetMATHGoogle Scholar
- Zhou G-P: An intriguing controversy over protein structural class prediction. Journal of Protein Chemistry 1998,17(8):729-738. 10.1023/A:1020713915365View ArticleGoogle Scholar
- Percival DB: On estimation of wavelet variance. Biometrika 1995,82(3):619-631. 10.1093/biomet/82.3.619MathSciNetView ArticleMATHGoogle Scholar
- Bhaskaran R, Ponnuswamy PK: Positional flexibilities of amino acid residues in globular proteins. International Journal of Peptide and Protein Research 1988, 32: 241-255.View ArticleGoogle Scholar
- Cid H, Bunster M, Canales M, Gazitúa F: Hydrophobicity and structural classes in proteins. Protein Engineering 1992,5(5):373-375. 10.1093/protein/5.5.373View ArticleGoogle Scholar
- Charton M, Charton BI: The structural dependence of amino acid hydrophobicity parameters. Journal of Theoretical Biology 1982,99(4):629-644. 10.1016/0022-5193(82)90191-6View ArticleGoogle Scholar
- Simon Z: Quantum Biochemistry and Specific Interactions. Abacus Press, Tunbridge Wells, Kent, UK; 1976.Google Scholar
- Chothia C: The nature of the accessible and buried surfaces in proteins. Journal of Molecular Biology 1976,105(1):1-12. 10.1016/0022-2836(76)90191-1View ArticleGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. Volume 5. Edited by: Dayhoff MO. National Biomedical Research Foundation, Washington, DC, USA; 1978:345-352.Google Scholar
- Zimmerman JM, Eliezer N, Simha R: The characterization of amino acid sequences in proteins by statistical methods. Journal of Theoretical Biology 1968,21(2):170-201. 10.1016/0022-5193(68)90069-6View ArticleGoogle Scholar
- Grantham R: Amino acid difference formula to help explain protein evolution. Science 1974,185(4154):862-864. 10.1126/science.185.4154.862View ArticleGoogle Scholar
- Fauchere J-L, Charton M, Kier LB, Verloop A, Pliska V: Amino acid side chain parameters for correlation studies in biology and pharmacology. International Journal of Peptide and Protein Research 1988,32(4):269-278.View ArticleGoogle Scholar
- Fasman GD: Practical Handbook of Biochemistry and Molecular Biology. CRC Press, Boca Raton, Fla, USA; 1989.Google Scholar
- Daubechies I: Ten Lectures on Wavelets. SIAM, Philadelphia, Pa, USA; 1992.View ArticleMATHGoogle Scholar
- Mallat SG: Theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 1989,11(7):674-693. 10.1109/34.192463View ArticleMATHGoogle Scholar
- Percival DB, Walden AT: Wavelet Methods for Time Series Analysis. Cambridge Press, Cambridge, UK; 2002.MATHGoogle Scholar
- Whitcher B, Guttorp P, Percival DB: Wavelet analysis of covariance with application to atmospheric time series. Journal of Geophysical Research 2000,105(D11):941-962.View ArticleGoogle Scholar
- Gallegati M, Gallegati M: Wavelet variance and correlation analyses of output in G7 countries. Macroeconomics 2005, 0512017: 1-19.MATHGoogle Scholar
- Xiong X, Zhang X-T, Zhang W, Li C-Y: Wavelet-based beta estimation of China stock market. Proceedings of the 4th International Conference on Machine Learning and Cybernetics (ICMLC '05), vol. 6 Guangzhou, China, August 2005 3501-3505.Google Scholar
- Percival DB, Mofjeld HO: Analysis of subtidal coastal sea level fluctuations using wavelets. Journal of the American Statistical Association 1997,92(439):868-880. 10.2307/2965551View ArticleMATHGoogle Scholar
- Cortes C, Vapnik V: Support vector networks. Machine Learning 1995,20(3):273-297.MATHGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines.National Taiwan University, Taipei, Taiwan; 2004. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]Google Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000,16(5):412-424. 10.1093/bioinformatics/16.5.412View ArticleGoogle Scholar
- Ding CHQ, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 2001,17(4):349-358. 10.1093/bioinformatics/17.4.349View ArticleGoogle Scholar
- Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001,17(8):721-728. 10.1093/bioinformatics/17.8.721View ArticleGoogle Scholar
- Fawcett T: An introduction to ROC analysis. Pattern Recognition Letters 2006,27(8):861-874. 10.1016/j.patrec.2005.10.010MathSciNetView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.