- Research Article
- Open access
- Published:
A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification
EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 87356 (2007)
Abstract
We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.
References
Weiss O, Jiménez-Montaño MA, Herzel H: Information content of protein sequences. Journal of Theoretical Biology 2000, 206(3):379-386. 10.1006/jtbi.2000.2138
Cline MS, Karplus K, Lathrop RH, Smith TF, Rogers RG Jr., Haussler D: Information-theoretic dissection of pairwise contact potentials. Proteins: Structure, Function and Genetics 2002, 49(1):7-14. 10.1002/prot.10198
Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005, 21(22):4116-4124. 10.1093/bioinformatics/bti671
Bateman A, Coin L, Durbin R, et al.: The Pfam protein families database. Nucleic Acids Research 2004, 32(Database):D138-D141.
Atchley WR, Terhalle W, Dress A: Positional dependence, cliques, and predictive motifs in the bHLH protein domain. Journal of Molecular Evolution 1999, 48(5):501-516. 10.1007/PL00006494
Weiss O, Herzel H: Correlations in protein sequences and property codes. Journal of Theoretical Biology 1998, 190(4):341-353. 10.1006/jtbi.1997.0560
Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience, New York, NY, USA; 1991.
Grosse I, Herzel H, Buldyrev SV, Stanley HE: Species independence of mutual information in coding and noncoding DNA. Physical Review E 2000, 61(5):5624-5629. 10.1103/PhysRevE.61.5624
Jiménez-Montaño MA: On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology 1984, 46(4):641-659.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403-410.
Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems. 2nd edition. Morgan Kaufmann, San Francisco, Calif, USA; 2005.
Cover TM, Hart P: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967, 13(1):21-27. 10.1109/TIT.1967.1053964
Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Machine Learning 1991, 6(1):37-66.
Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95), Montréal, Québec, Canada, August 1995 2: 1137-1145.
Herzel H, Schmitt AO, Ebeling W: Finite sample effects in sequence analysis. Chaos, Solitons & Fractals 1994, 4(1):97-113. 10.1016/0960-0779(94)90020-5
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Hemmerich, C., Kim, S. A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification. J Bioinform Sys Biology 2007, 87356 (2007). https://doi.org/10.1155/2007/87356
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1155/2007/87356