Unbiased bootstrap error estimation for linear discriminant analysis
- Thang Vu^{1},
- Chao Sima^{2},
- Ulisses M Braga-Neto^{1, 2}Email author and
- Edward R Dougherty^{1, 2}
DOI: 10.1186/s13637-014-0015-0
© Vu et al.; licensee Springer. 2014
Received: 17 February 2014
Accepted: 18 August 2014
Published: 3 October 2014
Abstract
Convex bootstrap error estimation is a popular tool for classifier error estimation in gene expression studies. A basic question is how to determine the weight for the convex combination between the basic bootstrap estimator and the resubstitution estimator such that the resulting estimator is unbiased at finite sample sizes. The well-known 0.632 bootstrap error estimator uses asymptotic arguments to propose a fixed 0.632 weight, whereas the more recent 0.632+ bootstrap error estimator attempts to set the weight adaptively. In this paper, we study the finite sample problem in the case of linear discriminant analysis under Gaussian populations. We derive exact expressions for the weight that guarantee unbiasedness of the convex bootstrap error estimator in the univariate and multivariate cases, without making asymptotic simplifications. Using exact computation in the univariate case and an accurate approximation in the multivariate case, we obtain the required weight and show that it can deviate significantly from the constant 0.632 weight, depending on the sample size and Bayes error for the problem. The methodology is illustrated by application on data from a well-known cancer classification study.
Keywords
Bootstrap Error estimation Bias Linear discriminant analysis Gene expression classification1Introduction
The bootstrap method [1]–[7] has been used in a wide range of statistical problems. The asymptotic behavior of bootstrap has been studied [8]–[11], while small-sample properties have been studied under simplifying assumptions, such as considering the estimator based on all possible bootstrap samples (the ‘complete’ bootstrap) [12]–[14]. The small-sample properties of the usual bootstrap are not well understood, in particular when it comes to estimating the error rates of classification rules [15],[16].
There has been, on the other hand, interest in the application of bootstrap to error estimation in classification problems and, in particular, gene expression classification studies [17]–[20]. Of particular interest is the issue of classifier error estimation [21],[22]. Bootstrap methods have generally been shown to outperform more traditional error estimation techniques, such as resubstitution and cross-validation, in terms of root-mean-square (RMS) error [4],[5],[7],[23]–[35]. Bootstrap error estimation is typically performed via a convex combination of the (generally) pessimistic basic bootstrap estimator, known as the zero bootstrap, and the (generally) optimistic resubstitution estimator. A basic problem is how to choose the weight that yields an unbiased estimator.
The problem of unbiased convex error estimation was previously considered in [36]–[38] for a convex combination of resubstitution and cross-validation estimators, and in [4],[7],[23] for a combination between resubstitution and the basic bootstrap estimator. In the former case, a fixed suboptimal weight of 0.5 was proposed in [36],[38], while an asymptotic analysis to find the optimal weight was provided in [37]. In the latter case, our case of interest, a fixed suboptimal weight of 0.632 was proposed in [4], leading to the well-known 0.632 bootstrap estimator, while in [7], a suboptimal weight is computed by means of a sample-based procedure, which attempts to counterbalance the effect of overfitting on the bias, leading to the so-called 0.632+ bootstrap error estimator; the problem of finding the optimal weight for finite sample cases was addressed via a numerical approach in [23].
Here, we determine the optimal weight for finite sample cases analytically, in the case of linear discriminant analysis under Gaussian populations. In the univariate case, no other assumptions are made. In the multivariate case, it is assumed that the populations are homoskedastic and that the common covariance matrix is known and used in the discriminant. In either case, no simplifications are introduced to the bootstrap error estimator; it is the usual one, based on a finite number of random bootstrap samples.
The analysis in this paper follows in the steps of previous papers that have provided analytical representations for the moments of error-estimator distributions [39],[40]. In the univariate case, exact expressions are given for the expectation of the zero bootstrap error estimator, in the general heteroskedastic (general-variance) Gaussian case. By using similar expressions for the expected true and resubstitution error [39], this allows the exact calculation of the required weight. In the multivariate case, the expectation of the zero bootstrap error estimator is expressed as a probability involving the ratio of two noncentral chi-square variables, in the homoskedastic Gaussian case, assuming that the true common covariance matrix is used in the discriminant. The resulting expression is exact but necessitates approximation for its numerical computation. This is done in this paper via the Imhof-Pearson three-moment method, which is accurate in small-sample cases [41]. Use of similar expressions for the expected true and resubstitution error [40] then allows the exact calculation of the required weight.
In the homoskedastic case, the required weight for unbiasedness is shown to be a function only of the Bayes error and sample size. Accordingly, plots and tables of the required weight for varying values of Bayes error and sample size are presented; if the Bayes error can be estimated for a problem, this provides a way to obtain the optimal weight to use. In the univariate case, it was observed that as the sample size increases, the optimal weight settles on an asymptotic value of around 0.675, thus slightly over the heuristic value 0.632; by contrast, in the multivariate case (d=2), the asymptotic value appears to be strongly dependent on the Bayes error, being as a rule significantly smaller than 0.632, except for very small Bayes error.
This paper is organized as follows. The ‘Bootstrap classification’ section defines linear discriminant analysis as well as its application under bootstrap sampling. The ‘Bootstrap error estimation’ section reviews convex bootstrap error estimation. The ‘Unbiased bootstrap error estimation’ section contains the main theoretical results in the paper, providing the analytical expressions for the computation of the required convex bootstrap weight in the univariate and multivariate cases. The ‘Gene expression classification example’ section contains a demonstration of the usage of the optimal weight in bootstrap error estimation using data from the breast cancer classification study in [42],[43]. Lastly, the ‘Conclusions’ section contains a summary and concluding remarks.
All the proofs are presented in the Appendix.
2Bootstrap classification
Classification involves a predictor vector X∈R^{ d }, also known as a feature vector, which represents an individual from one of two populations Π_{0} and Π_{1} (we consider here only this binary classification problem). The classification problem is to assign X correctly to its population of origin. The populations are coded into a discrete label Y∈{0,1}. Therefore, given a feature vector X, classification attempts to predict the corresponding value of the label Y. We assume that there is a joint feature-label distribution F_{ XY } for the pair (X,Y) characterizing the classification problem. In particular, it determines the probabilities c_{0}=P(X∈Π_{0})=P(Y=0) and c_{1}=P(X∈Π_{1})=P(Y=1), which are called the prior probabilities.
Given a fixed sample size n, the sample data is an i.i.d. sample S_{ n }={(X_{1},Y_{1}),…,(X_{ n },Y_{ n })} from F_{ XY }. The population-specific sample sizes are given by ${n}_{0}=\sum _{i=1}^{n}{I}_{{Y}_{i}=0}$ and ${n}_{1}=\sum _{i=1}^{n}{I}_{{Y}_{i}=1}=n-{n}_{0}$, which are random variables, with n_{0}∼Binomial(n,c_{0}) and n_{1}∼Binomial(n,c_{1}). When we need to emphasize that n_{0} and n_{1} are random variables, we will use capital letters N_{0} and N_{1}, respectively. This sampling design, which is the most commonly found one in contemporary pattern recognition, is known as mixture sampling[44].
where (X,Y) is an independent test point and ${\epsilon}_{n}^{i}=P\left({\psi}_{n}\right(X)=1-i\mid Y=i)$ is the error rate specific to population Π_{ i }, for i=0,1. Since the training set S_{ n } is random, ε_{ n } is a random variable, with expected classification error rate E[ ε_{ n }]; this gives the average performance over all possible training sets S_{ n }, for fixed sample size n.
are the sample means relative to each population, and Σ is a matrix, which can be either (1) the true common covariance matrix of the populations, assuming it is known (this is the approach followed, for example, in [39],[40],[46]), or (2) the sample covariance matrix based on the pooled sample S_{ n }, which leads to the general LDA case. In this paper, we will assume case (1) throughout.
that is, the sign of W(X) determines the classification of X.
A bootstrap sample${S}_{n}^{\ast}$ contains n instances drawn uniformly, with replacement, from S_{ n }. Hence, some of the instances in S_{ n } may appear multiple times in ${S}_{n}^{\ast}$, whereas others may not appear at all. Let C be a vector of size n, where the i th component C(i) equals the number of appearances in ${S}_{n}^{\ast}$ of the i th instance in S_{ n }. The vector C will be referred to as a bootstrap vector.
For a given S_{ n }, the vector C uniquely determines a bootstrap sample ${S}_{n}^{\ast}$, which we denote by ${S}_{n}^{C}$. Note that the original sample itself is included: if $C=(1,\dots ,1)\stackrel{\text{def}}{=}{\mathbf{1}}_{n}$, then ${S}_{n}^{C}={S}_{n}$, since each original instance appears once in the bootstrap sample. Note also that the number of distinct bootstrap samples, i.e., values for C, is equal to $\left(\genfrac{}{}{0.0pt}{}{2n-1}{n}\right)$; even for small n, this is a large number. For example, the total number of possible bootstrap samples of size n=20 is larger than 6.8×10^{10}.
Starting from a classification rule Ψ_{ n }, one may design a classifier ${\psi}_{n}^{C}={\Psi}_{n}\left({S}^{C}\right)$ on a bootstrap training set S^{ C }. Its classification error ${\epsilon}_{n}^{C}$ is given as in (1), namely, ${\epsilon}_{n}^{C}={c}_{0}{\epsilon}_{n}^{C,0}+{c}_{1}{\epsilon}_{n}^{C,1}$ where ${\epsilon}_{n}^{C,i}=P\left({\psi}_{n}^{C}\right(X)=1-i\mid Y=i)$ is the error rate specific to population Π_{ i }, for i=0,1. In this paper, we apply this scheme to the LDA classification rule defined previously. Notice the distinction between a bootstrap LDA classifier and a ‘bagged’ (bootstrap-aggregated) LDA classifier [47],[48]; these correspond to distinct classification rules. The bootstrap LDA classifier is employed here as an auxiliary tool to analyze the problem of unbiased bootstrap error estimation for the plain LDA classifier.
3Bootstrap error estimation
This resubstitution estimator, or apparent error, is often optimistically biased, that is, it is often the case that $\text{Bias}\phantom{\rule{0.3em}{0ex}}\left({\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}r}\right)=E\phantom{\rule{0.3em}{0ex}}\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}r}\right]-E\left[\phantom{\rule{0.3em}{0ex}}{\epsilon}_{n}\right]<0$, though this is not always so. The bias tends to worsen with more complex classification rules [49].
where n(C) is the number of zeros in C.
Selecting the appropriate weight w=w^{∗} leads to an unbiased error estimator, $E\left[\phantom{\rule{0.3em}{0ex}}{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}\text{conv}}\right]=E\left[\phantom{\rule{0.3em}{0ex}}{\epsilon}_{n}\right]$.
which has been heavily employed in the machine learning field.
4Unbiased bootstrap error estimation
The 0.632 bootstrap error estimator reviewed in the previous section is not guaranteed to be unbiased. In this section, we will examine the necessary conditions for setting the weight w=w^{∗} in (8) to achieve unbiasedness. We will then particularize the analysis to the Gaussian linear discriminant case, where exact expressions for w^{∗} will be derived, both in the univariate and multivariate cases.
that produces an unbiased error estimator.
where p(C) is given by (5) and the sum is taken over all possible values of C (an efficient procedure for listing all multinomial vectors is provided by the NEXCOM routine given in [50], Chapter 5). Equations (11) and (12) allow the computation of the weight w^{∗} given the knowledge of E[ε_{ n }], $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}r}\right]$, and $E\left[{\epsilon}_{n}^{\phantom{\rule{0.3em}{0ex}}C}\mid C\right]$. We will present next exact formulas for these expectations in the case of the LDA classification rule under Gaussian populations.
4.1 Univariate case
The following functions will be useful. Let Φ(u)=P(Z≤u) and Φ(u,v;ρ)=P((Z_{1},Z_{2})≤(u,v)), where Z is a zero-mean, unit-variance Gaussian random variable, and Z_{1}, Z_{2} are zero-mean, unit-variance random variables that are jointly Gaussian distributed, with correlation coefficient ρ.
Assume that population Π_{ i } is distributed as N(μ_{ i },σ_{ i }), for i=0,1, where σ_{0}≠σ_{1} in general.
are bootstrap sample means.
Now, note that with N_{0}=n_{0} fixed, the training data labels Y_{ i }, i=1,…,n, are no longer random. Since all classification rules of interest are invariant to reordering of the training data, we can, without loss of generality, reorder the sample points so that Y_{ i }=0 for i=1,…,n_{0}, and Y_{1}=1 for i=n_{0}+1,…,n. Let the same reordering be applied to a given bootstrap vector C. The next theorem extends John’s result to the classification error of the bootstrapped LDA classification rule defined by (23).
Theorem 1.
The corresponding result for $E\left[{\epsilon}_{n}^{C,1}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{N}_{0}={n}_{0},C\right]$ is obtained by interchanging all indices 0 and 1.
Proof. See the Appendix.
The expected bootstrap error rate $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}\text{boot}}\right]$ can now be computed via (12).
The weight w^{∗} for unbiased bootstrap error estimation can now be computed exactly by means of Equations (11), (12), (14) to (17), (20) to (22), and (25) to (28).
In the special case σ_{0}=σ_{1}=σ (homoskedasticity), it follows easily from the previous expressions that E[ε_{ n }], $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}r}\right]$, and $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}\text{boot}}\right]$ depend only on the sample size n and on the Mahalanobis distance between the populations δ=|μ_{1}−μ_{0}|/σ, and therefore so does the weight w^{∗}, through (11). Since the optimal (Bayes) classification error in this case is ε^{∗}=Φ(−δ/2), there is a one-to-one correspondence between Bayes error and the Mahalanobis distance. Therefore, in the homoskedastic case, the weight w^{∗}is a function only of the Bayes error ε^{∗}and the sample size n.
Univariate case: required weight w ^{ ∗ } for unbiased convex bootstrap estimation
n=10 | n=20 | n=30 | n=40 | n=50 | n=60 | n=70 | n=80 | n=90 | n=100 | |
---|---|---|---|---|---|---|---|---|---|---|
ε^{∗}=0.025 | 0.724 | 0.687 | 0.679 | 0.675 | 0.674 | 0.672 | 0.671 | 0.671 | 0.670 | 0.670 |
ε^{∗}=0.050 | 0.736 | 0.696 | 0.685 | 0.680 | 0.678 | 0.676 | 0.674 | 0.673 | 0.672 | 0.672 |
ε^{∗}=0.075 | 0.738 | 0.701 | 0.689 | 0.683 | 0.679 | 0.677 | 0.676 | 0.674 | 0.674 | 0.673 |
ε^{∗}=0.100 | 0.729 | 0.704 | 0.691 | 0.684 | 0.681 | 0.678 | 0.677 | 0.675 | 0.674 | 0.673 |
ε^{∗}=0.125 | 0.708 | 0.701 | 0.692 | 0.686 | 0.682 | 0.679 | 0.677 | 0.676 | 0.675 | 0.674 |
ε^{∗}=0.150 | 0.681 | 0.692 | 0.693 | 0.687 | 0.683 | 0.680 | 0.678 | 0.677 | 0.676 | 0.675 |
ε^{∗}=0.175 | 0.646 | 0.670 | 0.688 | 0.687 | 0.683 | 0.680 | 0.678 | 0.677 | 0.676 | 0.675 |
ε^{∗}=0.200 | 0.625 | 0.631 | 0.673 | 0.683 | 0.683 | 0.681 | 0.679 | 0.677 | 0.676 | 0.675 |
ε^{∗}=0.225 | 0.614 | 0.574 | 0.639 | 0.671 | 0.679 | 0.680 | 0.679 | 0.677 | 0.676 | 0.675 |
ε^{∗}=0.250 | 0.617 | 0.516 | 0.579 | 0.635 | 0.663 | 0.673 | 0.676 | 0.677 | 0.676 | 0.675 |
ε^{∗}=0.275 | 0.641 | 0.470 | 0.498 | 0.563 | 0.617 | 0.648 | 0.664 | 0.671 | 0.673 | 0.674 |
ε^{∗}=0.300 | 0.676 | 0.459 | 0.425 | 0.464 | 0.523 | 0.577 | 0.616 | 0.641 | 0.656 | 0.665 |
ε^{∗}=0.325 | 0.724 | 0.487 | 0.393 | 0.379 | 0.405 | 0.451 | 0.502 | 0.548 | 0.587 | 0.614 |
ε^{∗}=0.350 | 0.780 | 0.549 | 0.422 | 0.356 | 0.331 | 0.334 | 0.356 | 0.389 | 0.428 | 0.469 |
ε^{∗}=0.375 | 0.837 | 0.639 | 0.505 | 0.412 | 0.350 | 0.310 | 0.288 | 0.280 | 0.282 | 0.295 |
ε^{∗}=0.400 | 0.890 | 0.741 | 0.626 | 0.533 | 0.458 | 0.398 | 0.350 | 0.312 | 0.283 | 0.261 |
ε^{∗}=0.425 | 0.935 | 0.842 | 0.761 | 0.690 | 0.627 | 0.570 | 0.519 | 0.474 | 0.434 | 0.399 |
ε^{∗}=0.450 | 0.971 | 0.925 | 0.884 | 0.845 | 0.808 | 0.772 | 0.739 | 0.707 | 0.676 | 0.647 |
n =110 | n =120 | n =130 | n =140 | n =150 | n =160 | n =170 | n =180 | n =190 | n =200 | |
ε^{∗}=0.025 | 0.669 | 0.669 | 0.669 | 0.669 | 0.669 | 0.669 | 0.669 | 0.668 | 0.668 | 0.668 |
ε^{∗}=0.050 | 0.671 | 0.671 | 0.671 | 0.671 | 0.670 | 0.670 | 0.670 | 0.669 | 0.670 | 0.669 |
ε^{∗}=0.075 | 0.672 | 0.672 | 0.671 | 0.671 | 0.671 | 0.671 | 0.670 | 0.670 | 0.670 | 0.670 |
ε^{∗}=0.100 | 0.673 | 0.672 | 0.672 | 0.671 | 0.671 | 0.671 | 0.671 | 0.670 | 0.670 | 0.670 |
ε^{∗}=0.125 | 0.673 | 0.673 | 0.672 | 0.672 | 0.672 | 0.671 | 0.671 | 0.671 | 0.670 | 0.670 |
ε^{∗}=0.150 | 0.674 | 0.673 | 0.673 | 0.672 | 0.672 | 0.672 | 0.671 | 0.671 | 0.671 | 0.671 |
ε^{∗}=0.175 | 0.674 | 0.673 | 0.673 | 0.672 | 0.672 | 0.672 | 0.672 | 0.671 | 0.671 | 0.671 |
ε^{∗}=0.200 | 0.674 | 0.673 | 0.673 | 0.673 | 0.672 | 0.672 | 0.672 | 0.671 | 0.671 | 0.671 |
ε^{∗}=0.225 | 0.675 | 0.674 | 0.673 | 0.672 | 0.672 | 0.672 | 0.672 | 0.672 | 0.671 | 0.671 |
ε^{∗}=0.250 | 0.675 | 0.674 | 0.673 | 0.673 | 0.672 | 0.672 | 0.672 | 0.672 | 0.671 | 0.671 |
ε^{∗}=0.275 | 0.674 | 0.674 | 0.673 | 0.673 | 0.673 | 0.673 | 0.672 | 0.671 | 0.671 | 0.671 |
ε^{∗}=0.300 | 0.669 | 0.671 | 0.672 | 0.672 | 0.672 | 0.672 | 0.672 | 0.672 | 0.672 | 0.672 |
ε^{∗}=0.325 | 0.635 | 0.648 | 0.657 | 0.663 | 0.666 | 0.668 | 0.669 | 0.670 | 0.671 | 0.671 |
ε^{∗}=0.350 | 0.508 | 0.543 | 0.572 | 0.597 | 0.615 | 0.630 | 0.642 | 0.649 | 0.655 | 0.660 |
ε^{∗}=0.375 | 0.313 | 0.337 | 0.365 | 0.394 | 0.425 | 0.455 | 0.484 | 0.511 | 0.536 | 0.557 |
ε^{∗}=0.400 | 0.245 | 0.234 | 0.229 | 0.228 | 0.229 | 0.235 | 0.243 | 0.254 | 0.268 | 0.283 |
ε^{∗}=0.425 | 0.367 | 0.338 | 0.313 | 0.290 | 0.270 | 0.253 | 0.238 | 0.224 | 0.213 | 0.203 |
ε^{∗}=0.450 | 0.620 | 0.594 | 0.569 | 0.545 | 0.522 | 0.501 | 0.480 | 0.461 | 0.442 | 0.424 |
4.2 Multivariate case
where δ^{2} = (μ_{1}−μ_{0})^{ T }Σ^{−1}(μ_{1}−μ_{0}) is the squared Mahalanobis distance between the populations. The corresponding result for $E[{\epsilon}_{n}^{1}\mid {N}_{0}={n}_{0}]$ is obtained by interchanging n_{0} and n_{1}. The expected true error rate can then be found by using (16).
The corresponding result for $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}r,1}\right]$ is obtained by interchanging n_{0} and n_{1}. The expected resubstitution error rate can then be found by using (22).
where ${\widehat{\mu}}_{0}^{C}$ and ${\widehat{\mu}}_{1}^{C}$ are defined in (24). The next theorem generalizes John’s result for the multivariate classification error to the case of the bootstrapped LDA classification rule.
Theorem 2.
where s_{0} and s_{1} are defined in (27). The corresponding result for $E[{\epsilon}_{n}^{C,1}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{N}_{0}={n}_{0},C]$ is obtained by interchanging s_{0} and s_{1}.
Proof. See the Appendix.
It is easy to check that the result in Theorem 2 reduces to the one in (29) and (30) when C=1_{ n }.
As in the univariate case, Theorem 2 can be used in conjunction with Equations (12) and (28) to compute $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}\text{boot}}\right]$.
The weight w^{∗} for unbiased bootstrap error estimation can now be computed exactly by means of Equations (11), (12), (16) to (17), (22), (28), (29) to (32), and (34) to (35).
The same approximation method applies to (31) and (34) by substituting the appropriate values.
As in the univariate case, the assumption of a common covariance matrix Σ makes the expectations E[ε_{ n }], $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}r}\right]$, and $E\left[{\widehat{\epsilon}}_{n}^{\phantom{\rule{0.3em}{0ex}}\text{boot}}\right]$ and thus also the weight w^{∗}, functions only of n and δ. Since ε^{∗}=Φ(−δ/2), this means that the weight w^{∗} is a function only of the Bayes error ε^{∗} and the sample size n.
Bivariate case: required weight w ^{ ∗ } for unbiased convex bootstrap estimation
n=10 | n=20 | n=30 | n=40 | n=50 | n=60 | n=70 | n=80 | n=90 | n=100 | |
---|---|---|---|---|---|---|---|---|---|---|
ε^{∗}=0.025 | 0.664 | 0.667 | 0.679 | 0.685 | 0.690 | 0.693 | 0.695 | 0.697 | 0.698 | 0.699 |
ε^{∗}=0.050 | 0.666 | 0.637 | 0.638 | 0.639 | 0.641 | 0.642 | 0.642 | 0.643 | 0.644 | 0.644 |
ε^{∗}=0.075 | 0.670 | 0.617 | 0.610 | 0.608 | 0.606 | 0.606 | 0.605 | 0.605 | 0.605 | 0.605 |
ε^{∗}=0.100 | 0.675 | 0.604 | 0.590 | 0.584 | 0.581 | 0.578 | 0.577 | 0.576 | 0.575 | 0.574 |
ε^{∗}=0.125 | 0.682 | 0.594 | 0.573 | 0.564 | 0.559 | 0.555 | 0.553 | 0.551 | 0.550 | 0.548 |
ε^{∗}=0.150 | 0.691 | 0.588 | 0.560 | 0.547 | 0.539 | 0.534 | 0.530 | 0.528 | 0.526 | 0.524 |
ε^{∗}=0.175 | 0.699 | 0.586 | 0.554 | 0.539 | 0.530 | 0.524 | 0.520 | 0.517 | 0.515 | 0.513 |
ε^{∗}=0.200 | 0.718 | 0.586 | 0.544 | 0.524 | 0.512 | 0.504 | 0.498 | 0.493 | 0.490 | 0.487 |
ε^{∗}=0.225 | 0.738 | 0.592 | 0.542 | 0.517 | 0.502 | 0.492 | 0.485 | 0.479 | 0.475 | 0.471 |
ε^{∗}=0.250 | 0.759 | 0.603 | 0.545 | 0.515 | 0.497 | 0.485 | 0.476 | 0.469 | 0.464 | 0.460 |
ε^{∗}=0.275 | 0.784 | 0.620 | 0.553 | 0.518 | 0.497 | 0.482 | 0.471 | 0.463 | 0.457 | 0.452 |
ε^{∗}=0.300 | 0.815 | 0.647 | 0.572 | 0.530 | 0.503 | 0.485 | 0.472 | 0.462 | 0.454 | 0.448 |
ε^{∗}=0.325 | 0.847 | 0.681 | 0.598 | 0.550 | 0.518 | 0.496 | 0.480 | 0.468 | 0.458 | 0.450 |
ε^{∗}=0.350 | 0.882 | 0.728 | 0.639 | 0.584 | 0.546 | 0.520 | 0.500 | 0.484 | 0.472 | 0.462 |
ε^{∗}=0.375 | 0.915 | 0.784 | 0.695 | 0.635 | 0.592 | 0.560 | 0.535 | 0.516 | 0.500 | 0.487 |
ε^{∗}=0.400 | 0.943 | 0.842 | 0.763 | 0.702 | 0.655 | 0.619 | 0.590 | 0.566 | 0.546 | 0.530 |
ε^{∗}=0.425 | 0.971 | 0.914 | 0.859 | 0.811 | 0.769 | 0.732 | 0.701 | 0.673 | 0.650 | 0.629 |
ε^{∗}=0.450 | 0.987 | 0.960 | 0.933 | 0.905 | 0.879 | 0.853 | 0.830 | 0.807 | 0.786 | 0.766 |
n =110 | n =120 | n =130 | n =140 | n =150 | n =160 | n =170 | n =180 | n =190 | n =200 | |
ε^{∗}=0.025 | 0.700 | 0.701 | 0.701 | 0.702 | 0.702 | 0.703 | 0.703 | 0.704 | 0.704 | 0.704 |
ε^{∗}=0.050 | 0.644 | 0.645 | 0.645 | 0.645 | 0.645 | 0.645 | 0.645 | 0.646 | 0.646 | 0.646 |
ε^{∗}=0.075 | 0.604 | 0.604 | 0.604 | 0.604 | 0.604 | 0.604 | 0.604 | 0.604 | 0.604 | 0.604 |
ε^{∗}=0.100 | 0.574 | 0.573 | 0.573 | 0.573 | 0.573 | 0.572 | 0.572 | 0.572 | 0.572 | 0.572 |
ε^{∗}=0.125 | 0.548 | 0.547 | 0.546 | 0.546 | 0.545 | 0.545 | 0.544 | 0.544 | 0.544 | 0.543 |
ε^{∗}=0.150 | 0.523 | 0.522 | 0.521 | 0.520 | 0.519 | 0.518 | 0.518 | 0.517 | 0.517 | 0.517 |
ε^{∗}=0.175 | 0.511 | 0.510 | 0.509 | 0.508 | 0.507 | 0.506 | 0.506 | 0.505 | 0.505 | 0.504 |
ε^{∗}=0.200 | 0.485 | 0.483 | 0.482 | 0.480 | 0.479 | 0.478 | 0.477 | 0.477 | 0.476 | 0.475 |
ε^{∗}=0.225 | 0.469 | 0.466 | 0.464 | 0.463 | 0.461 | 0.460 | 0.459 | 0.458 | 0.457 | 0.456 |
ε^{∗}=0.250 | 0.457 | 0.454 | 0.452 | 0.449 | 0.448 | 0.446 | 0.445 | 0.443 | 0.442 | 0.441 |
ε^{∗}=0.275 | 0.448 | 0.444 | 0.442 | 0.439 | 0.437 | 0.435 | 0.433 | 0.432 | 0.430 | 0.429 |
ε^{∗}=0.300 | 0.443 | 0.438 | 0.435 | 0.432 | 0.429 | 0.426 | 0.424 | 0.422 | 0.420 | 0.419 |
ε^{∗}=0.325 | 0.444 | 0.439 | 0.434 | 0.430 | 0.426 | 0.423 | 0.421 | 0.418 | 0.416 | 0.414 |
ε^{∗}=0.350 | 0.454 | 0.447 | 0.441 | 0.435 | 0.431 | 0.427 | 0.423 | 0.420 | 0.417 | 0.415 |
ε^{∗}=0.375 | 0.476 | 0.467 | 0.459 | 0.452 | 0.446 | 0.441 | 0.436 | 0.432 | 0.428 | 0.424 |
ε^{∗}=0.400 | 0.516 | 0.504 | 0.493 | 0.484 | 0.476 | 0.469 | 0.462 | 0.457 | 0.451 | 0.447 |
ε^{∗}=0.425 | 0.611 | 0.594 | 0.580 | 0.567 | 0.555 | 0.544 | 0.535 | 0.526 | 0.518 | 0.511 |
ε^{∗}=0.450 | 0.748 | 0.731 | 0.715 | 0.700 | 0.687 | 0.674 | 0.662 | 0.650 | 0.640 | 0.630 |
5Gene expression classification example
Here we demonstrate the application of the previous theory in comparing the performance of the bootstrap error estimator using the optimal weight versus the use of the fixed w=0.632 weight, using gene expression data from the well-known breast cancer classification study in [42], which analyzed expression profiles from 295 tumor specimens, divided into N_{0}=115 specimens belonging to the ‘good-prognosis’ population (class 1 here) and N_{1}=180 specimens belonging to the ‘poor-prognosis’ population (class 0).
Bias and RMS of estimators considered in the experiment with expression data from genes ‘OXCT’ and ‘WISP1’
c _{0} | n | ε ^{∗} | E[ε_{ n }] | Resub | Basic boot | Opt boot | 0.632 boot | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bias | RMS | Bias | RMS | Bias | RMS | Bias | RMS | ||||
0.33 | 30 | 0.4043 | 0.4206 | −0.0702 | 0.1061 | 0.0008 | 0.0820 | −0.0161 | 0.0803 | −0.0253 | 0.0817 |
0.50 | 30 | 0.3969 | 0.4266 | −0.0719 | 0.1060 | 0.0072 | 0.0830 | −0.0116 | 0.0798 | −0.0219 | 0.0806 |
0.67 | 30 | 0.3893 | 0.4131 | −0.0914 | 0.1185 | −0.0181 | 0.0878 | −0.0355 | 0.0885 | −0.0451 | 0.0909 |
6Conclusions
Exact expressions were derived for the required weight for unbiased convex bootstrap error estimation in the finite sample case, for linear discriminant analysis of Gaussian populations. The results not only provide the practitioner with a recommendation of what weight to use given the sample size and problem difficulty, but also offer insight into the choice of the 0.632 weight for the classic 0.632 bootstrap error estimator. It was observed that the required weight for unbiasedness can deviate significantly from the 0.632 weight, particularly in the multivariate case, where the required weight for unbiasedness appears to settle on an asymptotic value that is strongly dependent on the Bayes error, being as a rule smaller than 0.632. The results were illustrated by application to gene expression data from a well-known breast cancer study.
7Appendix
Proof of Theorem 1
The result then follows after some algebraic manipulation. By symmetry, to obtain $E[{\epsilon}_{C}^{1}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}C]$, one needs only to interchange all indices 0 and 1. □
Proof of Theorem 2
are independent noncentralchi-squared random variables with d degrees of freedom and noncentrality parameters λ_{5} and λ_{6} defined in (35). The result then follows from (62). Following along the same lines, one can show that $E[{\epsilon}_{C}^{1}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}C]$ is obtained by interchanging s_{0} and s_{1} in the result for $E[{\epsilon}_{C}^{0}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}C]$ (the details are omitted for brevity). □
Declarations
Acknowledgements
The authors acknowledge the support of the National Science Foundation, through NSF awards CCF-0845407 (Braga-Neto) and CCF-0634794 (Dougherty).
Authors’ Affiliations
References
- Efron B: Bootstrap methods: another look at the jackknife. Ann. Stat 1979,7(1):1-26. [Online]. [http://projecteuclid.org/euclid.aos/1176344552]MathSciNetView ArticleGoogle Scholar
- Efron B: Computers and the theory of statistics: thinking the unthinkable. SIAM Rev 1979,21(4):460-480. [Online]. [http://www.jstor.org/stable/2030104]MathSciNetView ArticleGoogle Scholar
- Efron B: Nonparametric standard errors and confidence intervals. Can. J. Stat. 1981,9(2):139-158. 10.2307/3314608MathSciNetView ArticleGoogle Scholar
- Efron B: Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc 1983,78(382):316-331. [Online]. [http://dx.doi.org/10.2307/2288636]MathSciNetView ArticleGoogle Scholar
- Efron B, Gong G: A leisurely look at the bootstrap, the jackknife, and cross-validation. Am. Stat 1983,37(1):36-48. [Online]. [http://dx.doi.org/10.2307/2685844]MathSciNetGoogle Scholar
- Efron B, Tibshirani R: An Introduction to the Bootstrap. Chapman & Hall, New York; 1993.View ArticleGoogle Scholar
- Efron B, Tibshirani R: Improvements on cross-validation: the.632+ bootstrap method. J. Am. Stat. Assoc 1997,92(438):548-560. [Online]. [http://dx.doi.org/10.2307/2965703]MathSciNetGoogle Scholar
- Singh K: On the asymptotic accuracy of Efron’s bootstrap. Ann. Stat 1981, 9: 1187-1195. 10.1214/aos/1176345636View ArticleGoogle Scholar
- Bickel P, Freedman D: Some asymptotic theory for the bootstrap. Ann. Stat 1981, 9: 1196-1217. 10.1214/aos/1176345637MathSciNetView ArticleGoogle Scholar
- Beran R: Estimated sampling distributions: the bootstrap and competitors. Ann. Stat 1982,10(1):212-225. [Online]. [http://www.jstor.org/stable/2240513]MathSciNetView ArticleGoogle Scholar
- Hall P: The Bootstrap and Edgeworth Expansion. Springer, New York; 1992.View ArticleGoogle Scholar
- Scholz F: The Bootstrap Small Sample Properties. University of, Washington, Seattle; 2007.Google Scholar
- Porter P, Rao S, Ku J-Y, Poirot R, Dakins M: Small sample properties of nonparametric bootstrap t confidence intervals. J. Air Waste Manag. Assoc 1997,47(11):1197-1203. 10.1080/10473289.1997.10464062View ArticleGoogle Scholar
- Chan K, Lee S: An exact iterated bootstrap algorithm for small-sample bias reduction. Comput. Stat. Data Anal 2001,36(1):1-13. 10.1016/S0167-9473(00)00029-3MathSciNetView ArticleGoogle Scholar
- Young G: Bootstrap: more than a stab in the dark? With discussion and a rejoinder by the author. Stat. Sci 1994,9(3):382-415. 10.1214/ss/1177010383View ArticleGoogle Scholar
- Shao J, Tu D: The Jackknife and Bootstrap. Springer, New York; 1995.View ArticleGoogle Scholar
- D Pils, D Tong, G Hager, E Obermayr, S Aust, G Heinze, M Kohl, E Schuster, A Wolf, J Sehouli, I Braicu, I Vergote, T Van Gorp, S Mahner, N Concin, P Speiser, R Zeillinger, A combined blood based gene expression and plasma protein abundance signature for diagnosis of epithelial ovarian cancer–a study of the OVCAD consortium. BMC Cancer. 13(178) (2013). doi: 10.1186/1471-2407-13-178.
- S Paul, P Maji, muHEM for identification of differentially expressed miRNAs using hypercuboid equivalence partition matrix. BMC Bioinformatics. 14(266) (2013). doi:10.1186/1471-2105-14-266.
- Student S: K Fujarewicz, Stable feature selection and classification algorithms for multiclass microarray data. Biol Direct. 2012, 7: 33. doi:10.1186/1745-6150-7-33 10.1186/1745-6150-7-33View ArticleGoogle Scholar
- T Hwang, CH Sun, T Yun, GS Yi, FiGS: a filter-based gene selection workbench for microarray data. BMC Bioinformatics. 11(50) (2010). doi:10.1186/1471-2105-11-50.
- McLachlan G: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York; 1992.View ArticleGoogle Scholar
- Devroye L, Gyorfi L, Lugosi G: A Probabilistic Theory of Pattern Recognition. Springer, New York; 1996.View ArticleGoogle Scholar
- Sima C, Dougherty E: Optimal convex error estimators for classification. Pattern Recognit 2006,39(6):1763-1780. 10.1016/j.patcog.2006.03.020View ArticleGoogle Scholar
- Chernick M, Murthy V, Nealy C: Application of bootstrap and other resampling techniques: evaluation of classifier performance. Pattern Recognit. Lett 1985,3(3):167-178. [Online] [http://www.sciencedirect.com/science/article/B6V15-48MPVCK-55/2/32754228bc17ac0655b9fa9a7a60ca90]View ArticleGoogle Scholar
- Fukunaga K, Hayes R: Estimation of classifier performance. IEEE Trans. Pattern Anal. Mach. Intell 1989,11(10):1087-1101. 10.1109/34.42839View ArticleGoogle Scholar
- G McLachlan, Error rate estimation in discriminant analysis: recent advancesAdv. Multivariate Stat. Anal, 233–252 (1987).
- Davison A, Hall P: On the bias and variability of bootstrap and cross-validation estimates of error rate in discrimination problems. Biometrika 1992,79(2):279-284. [Online] [http://www.jstor.org/stable/2336839]MathSciNetView ArticleGoogle Scholar
- Chernick M: Bootstrap Methods: A Guide for Practitioners and Researchers (Wiley Series in Probability and Statistics), 2nd ed.. Wiley-Interscience, Hoboken; 2007.View ArticleGoogle Scholar
- Chatterjee S, Chatterjee S: Estimation of misclassification probabilities by bootstrap methods. Comput 1983, 12: 645-656.Google Scholar
- Jain A, Dubes R, Chen C: Bootstrap techniques for error estimation. IEEE Trans. Pattern Anal. Mach. Intell 1987,9(5):628-633. 10.1109/TPAMI.1987.4767957View ArticleGoogle Scholar
- S Raudys, in Proceedings of Ninth International Joint Conference on Pattern Recognition,. On the accuracy of a bootstrap estimate of the classification erro (Rome 14–17 Nov 1988, p. 1230–1232(1988).
- Braga-Neto U, Dougherty E: Bolstered error estimation. Pattern Recognit 2004,37(6):1267-1281. [Online] [http://www.sciencedirect.com/science/article/B6V14-4BNMG7H-1/2/752fe2e9105d351b8850e48577ba182c]View ArticleGoogle Scholar
- Braga-Neto U, Hashimoto R, Dougherty E, Nguyen D, Carroll R: Is cross-validation better than re-substitution for ranking genes? Bioinformatics 2004,20(2):253-258. [Online] [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/2/253]View ArticleGoogle Scholar
- Braga-Neto U, Dougherty E: Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004,20(3):374-380. [Online]. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/3/374]View ArticleGoogle Scholar
- R Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection. (IJCAI), 1137–1145 (1995). [Online]. ., [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.529]
- Toussaint G: An efficient method for estimating the probability of misclassification applied to a problem in medical diagnosis. Comput. Biol. Med. 1975, 4: 269. 10.1016/0010-4825(75)90038-4View ArticleGoogle Scholar
- McLachlan G: A note on the choice of a weighting function to give an efficient method for estimating the probability of misclassification. Pattern Recognit. 1977,9(2):147-149. 10.1016/0031-3203(77)90012-7MathSciNetView ArticleGoogle Scholar
- Raudys S, Jain A: Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans. Pattern Anal. Mach. Intell 1991,13(3):4-37. 10.1109/34.75512View ArticleGoogle Scholar
- John S: Errors in discrimination. Ann. Math. Stat 1961,32(4):1125-1144. [Online]. [http://www.jstor.org/stable/2237911]View ArticleGoogle Scholar
- Moran M: On the expectation of errors of allocation associated with a linear discriminant function. Biometrika 1975,62(1):141-148. [Online]. [http://www.jstor.org/stable/2334496]MathSciNetView ArticleGoogle Scholar
- Imhof J: Computing the distribution of quadratic forms in normal variables. Biometrika 1961,48(3/4):419-426. 10.2307/2332763MathSciNetView ArticleGoogle Scholar
- van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Astma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med 2002,347(25):1999-2009. 10.1056/NEJMoa021967View ArticleGoogle Scholar
- van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415: 530-536. 10.1038/415530aView ArticleGoogle Scholar
- UM Braga-Neto, A Zollanvari, ER Dougherty, Cross-validation under separate sampling: strong bias and how to correct it. Bioinformatics (2014). doi:10.1093/bioinformatics/btu527.
- Anderson T: Classification by multivariate analysis. Psychometrika 1951, 16: 31-50. 10.1007/BF02313425MathSciNetView ArticleGoogle Scholar
- S Raudys, in Proc. 4th Int. Conf. Pattern Recognition. Comparison of the estimates of the probability of misclassificationKyoto, Japan, 1978), pp. 280–282.
- Breiman L: Bagging predictors. Mach. Learn. 1996,24(2):123-140.MathSciNetGoogle Scholar
- Vu T, Braga-Neto U: Is bagging effective in the classification of small-sample genomic and proteomic data? URASIP J. Bioinformatics Syst. Biol 2009, 2009: Article ID 158368. 10.1155/2009/158368View ArticleGoogle Scholar
- Vapnik V: Statistical Learning Theory. Wiley, New York; 1998.Google Scholar
- Nijenhuis A, Wilf H: Combinatorial Algorithms, 2nd ed. Academic Press, New York; 1978.Google Scholar
- Hills M: Allocation rules and their error rates. J. R. Stat. Soc. Series B (Methodological) 1966,28(1):1-31. [Online]. [http://www.jstor.org/stable/2984268]MathSciNetGoogle Scholar
- Zollanvari A, Braga-Neto U, Dougherty E: On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recognit 2009,42(11):2705-2723. 10.1016/j.patcog.2009.05.003View ArticleGoogle Scholar
- Price R: Some non-central f -distributions expressed in closed form. Biometrika 1964, 51: 107-122. 10.1093/biomet/51.1-2.107MathSciNetView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.