Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

Xiao, Yufei; Hua, Jianping; Dougherty, Edward R

doi:10.1155/2007/16354

Research Article
Open access
Published: 19 February 2007

Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

Yufei Xiao¹,
Jianping Hua² &
Edward R Dougherty^1,2

EURASIP Journal on Bioinformatics and Systems Biology volume 2007, Article number: 16354 (2007) Cite this article

2162 Accesses
9 Citations
Metrics details

Abstract

Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the -test for feature selection; and -fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.

[1 2 3 4 5 6 7 8 9 10 11 12 13]

References

Devroye L, Gyorfi L, Lugosi G: A Probabilistic Theory of Pattern Recognition. Springer, New York, NY, USA; 1996.
Book MATH Google Scholar
Braga-Neto U, Dougherty ER: Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004, 20(3):374-380. 10.1093/bioinformatics/btg419
Article Google Scholar
Braga-Neto U, Dougherty ER: Bolstered error estimation. Pattern Recognition 2004, 37(6):1267-1281. 10.1016/j.patcog.2003.08.017
Article MATH Google Scholar
Sima C, Braga-Neto U, Dougherty ER: Superior feature-set ranking for small samples using bolstered error estimation. Bioinformatics 2005, 21(7):1046-1054. 10.1093/bioinformatics/bti081
Article Google Scholar
Sima C, Attoor S, Brag-Neto U, Lowey J, Suh E, Dougherty ER: Impact of error estimation on feature selection. Pattern Recognition 2005, 38(12):2472-2482. 10.1016/j.patcog.2005.03.026
Article Google Scholar
Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics 2005, 21(15):3301-3307. 10.1093/bioinformatics/bti499
Article Google Scholar
Pudil P, Novovicova J, Kittler J: Floating search methods in feature selection. Pattern Recognition Letters 1994, 15(11):1119-1125. 10.1016/0167-8655(94)90127-9
Article Google Scholar
Xiao Y, Hua J, Dougherty ER: Feature selection increases cross-validation imprecision. Proceedings of the 4th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '06), College Station, Tex, USA, May 2006
Google Scholar
van't Veer LJ, Dai H, van de Vijver MJ, et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536. 10.1038/415530a
Article Google Scholar
van de Vijver MJ, He YD, van't Veer LJ, et al.: A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine 2002, 347(25):1999-2009. 10.1056/NEJMoa021967
Article Google Scholar
Choudhary A, Brun M, Hua J, Lowey J, Suh E, Dougherty ER: Genetic test bed for feature selection. Bioinformatics 2006, 22(7):837-842. 10.1093/bioinformatics/btl008
Article Google Scholar
Jain A, Zongker D: Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997, 19(2):153-158. 10.1109/34.574797
Article Google Scholar
Kudo M, Sklansky J: Comparison of algorithms that select features for pattern classifiers. Pattern Recognition 2000, 33(1):25-41. 10.1016/S0031-3203(99)00041-2
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
Yufei Xiao & Edward R Dougherty
Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ, 85004, USA
Jianping Hua & Edward R Dougherty

Authors

Yufei Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Hua
View author publications
You can also search for this author in PubMed Google Scholar
Edward R Dougherty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yufei Xiao.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Xiao, Y., Hua, J. & Dougherty, E.R. Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation. J Bioinform Sys Biology 2007, 16354 (2007). https://doi.org/10.1155/2007/16354

Download citation

Received: 07 August 2006
Revised: 21 December 2006
Accepted: 26 December 2006
Published: 19 February 2007
DOI: https://doi.org/10.1155/2007/16354

Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords