Skip to main content
  • Research Article
  • Open access
  • Published:

NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks

Abstract

Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.

[123456789101112131415161718192021222324252627282930313233]

References

  1. Korodi G, Tabus I: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Transactions on Information Systems 2005, 23(1):3-34. 10.1145/1055709.1055711

    Article  Google Scholar 

  2. Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown B: Clustering methods for the analysis of DNA microarray data. Department of Health Research and Policy, Stanford University, Stanford, Calif, USA; 1999.

    Google Scholar 

  3. Pan W, Lin J, Le CT: Model-based cluster analysis of microarray gene-expression data. Genome Biology 2002, 3(2):1-8.

    Article  Google Scholar 

  4. McLachlan GJ, Bean RW, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 2002, 18(3):413-422. 10.1093/bioinformatics/18.3.413

    Article  Google Scholar 

  5. Hartemink AJ, Gifford DK, Jaakkola TS, Young RA: Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), The Big Island of Hawaii, Hawaii, USA, January 2001 422-433.

    Google Scholar 

  6. Rissanen J: Modeling by shortest data description. Automatica 1978, 14(5):465-471. 10.1016/0005-1098(78)90005-5

    Article  MATH  Google Scholar 

  7. Rissanen J: Stochastic complexity. Journal of the Royal Statistical Society, Series B 1987, 49(3):223-239. with discussions, 223–265

    MathSciNet  MATH  Google Scholar 

  8. Rissanen J: Fisher information and stochastic complexity. IEEE Transactions on Information Theory 1996, 42(1):40-47. 10.1109/18.481776

    Article  MathSciNet  MATH  Google Scholar 

  9. Shtarkov YuM: Universal sequential coding of single messages. Problems of Information Transmission 1987, 23(3):175-186.

    MathSciNet  Google Scholar 

  10. Barron A, Rissanen J, Yu B: The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory 1998, 44(6):2743-2760. 10.1109/18.720554

    Article  MathSciNet  MATH  Google Scholar 

  11. Rissanen J: Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory 2001, 47(5):1712-1717. 10.1109/18.930912

    Article  MathSciNet  MATH  Google Scholar 

  12. GrĂĽnwald P: The Minimum Description Length Principle. The MIT Press, Cambridge, Mass, USA; 2007.

    Google Scholar 

  13. Rissanen J: Information and Complexity in Statistical Modeling. Springer, New York, NY, USA; 2007.

    MATH  Google Scholar 

  14. Heckerman D: A tutorial on learning with Bayesian networks. In Tech. Rep. MSR-TR-95-06. Microsoft Research, Advanced Technology Division, One Microsoft Way, Redmond, Wash, USA, 98052; 1996.

    Google Scholar 

  15. Kontkanen P, Myllymäki P: A linear-time algorithm for computing the multinomial stochastic complexity. Information Processing Letters 2007, 103(6):227-233. 10.1016/j.ipl.2007.04.003

    Article  MathSciNet  MATH  Google Scholar 

  16. Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H: An MDL framework for data clustering. In Advances in Minimum Description Length: Theory and Applications. Edited by: Grünwald P, Myung IJ, Pitt M. The MIT Press, Cambridge, Mass, USA; 2006.

    Google Scholar 

  17. Xie Q, Barron AR: Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory 2000, 46(2):431-445. 10.1109/18.825803

    Article  MathSciNet  MATH  Google Scholar 

  18. Balasubramanian V: MDL, Bayesian inference, and the geometry of the space of probability distributions. In Advances in Minimum Description Length: Theory and Applications. Edited by: GrĂĽnwald P, Myung IJ, Pitt M. The MIT Press, Cambridge, Mass, USA; 2006:81-98.

    Google Scholar 

  19. Kontkanen P, Myllymäki P: MDL histogram density estimation. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, (AISTATS '07), San Juan, Puerto Rico, USA, March 2007

    Google Scholar 

  20. Kontkanen P, Buntine W, Myllymäki P, Rissanen J, Tirri H: Efficient computation of stochastic complexity. In Proceedings of the 9th International Conference on Artificial Intelligence and Statistics, Key West, Fla, USA, January 2003. Edited by: Bishop C, Frey B. Society for Artificial Intelligence and Statistics; 233-238.

    Google Scholar 

  21. Koivisto M: Sum-Product Algorithms for the Analysis of Genetic Risks. In Tech. Rep. A-2004-1. Department of Computer Science, University of Helsinki, Helsinki, Finland; 2004.

    Google Scholar 

  22. Kontkanen P, Myllymäki P: A fast normalized maximum likelihood algorithm for multinomial data. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI '05), Edinburgh, Scotland, August 2005

    Google Scholar 

  23. Knuth DE, Pittle B: A recurrence related to trees. Proceedings of the American Mathematical Society 1989, 105(2):335-349. 10.1090/S0002-9939-1989-0949878-9

    Article  MathSciNet  MATH  Google Scholar 

  24. Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ, Knuth DE: On the Lambert W function. Advances in Computational Mathematics 1996, 5(1):329-359. 10.1007/BF02124750

    Article  MathSciNet  MATH  Google Scholar 

  25. Szpankowski W: Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York, NY, USA; 2001.

    Book  MATH  Google Scholar 

  26. Flajolet P, Odlyzko AM: Singularity analysis of generating functions. SIAM Journal on Discrete Mathematics 1990, 3(2):216-240. 10.1137/0403019

    Article  MathSciNet  MATH  Google Scholar 

  27. Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6(2):461-464. 10.1214/aos/1176344136

    Article  MathSciNet  MATH  Google Scholar 

  28. Kontkanen P, Myllymäki P, Tirri H: Constructing Bayesian finite mixture models by the EM algorithm. In Tech. Rep. NC-TR-97-003. ESPRIT Working Group on Neural and Computational Learning (NeuroCOLT), Helsinki, Finland; 1997.

    Google Scholar 

  29. Kontkanen P, Myllymäki P, Silander T, Tirri H: On Bayesian case matching. In Proceedings of the 4th European Workshop Advances in Case-Based Reasoning (EWCBR '98), Lecture Notes In Computer Science, Springer, Dublin, Ireland, September 1998 Edited by: Smyth B, Cunningham P. 1488: 13-24.

    Google Scholar 

  30. Grünwald P, Kontkanen P, Myllymäki P, Silander T, Tirri H: Minimum encoding approaches for predictive modeling. In Proceedings of the 14th International Conference on Uncertainty in Artificial Intelligence (UAI '98), Madison, Wis, USA, July 1998. Edited by: Cooper G, Moral S. Morgan Kaufmann; 183-192.

    Google Scholar 

  31. Kontkanen P, Myllymäki P, Silander T, Tirri H, Grünwald P: On predictive distributions and Bayesian networks. Statistics and Computing 2000, 10(1):39-54. 10.1023/A:1008984400380

    Article  Google Scholar 

  32. Kontkanen P, Lahtinen J, Myllymäki P, Silander T, Tirri H: Supervised model-based visualization of high-dimensional data. Intelligent Data Analysis 2000, 4(3-4):213-227.

    MATH  Google Scholar 

  33. Dyer M, Kannan R, Mount J: Sampling contingency tables. Random Structures and Algorithms 1997, 10(4):487-506. 10.1002/(SICI)1098-2418(199707)10:4<487::AID-RSA4>3.0.CO;2-Q

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petri Kontkanen.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kontkanen, P., Wettig, H. & Myllymäki, P. NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks. J Bioinform Sys Biology 2007, 90947 (2008). https://doi.org/10.1155/2007/90947

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/2007/90947

Keywords