Skip to main content

Identification of CpG islands in DNA sequences using statistically optimal null filters

Abstract

CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a non-CGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a non-CGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.

Introduction

In the recent years, computational methods for processing and interpreting vast amount of genomic data, generated from genome sequencing, have gained a lot of scientific interest. Genomic sequences such as deoxyribonucleic acid (DNA) consist of biological instructions which are crucial for the development and normal functioning of almost all living organisms[1]. A DNA molecule has a complex double helix structure that involves two strands, consisting of alternating sugar and phosphate groups. Attached to these sugar groups of each DNA strand are one of the four chemical bases, namely, adenine (A), thymine (T), guanine (G), and cytosine (C). A unit comprising of base, sugar, and phosphate is referred to as a nucleotide. Hydrogen bonds between the nucleotides A and T (similarly between nucleotides G and C) from the opposite strands not only stabilize the DNA molecule, but also make the two strands complimentary. Nucleotides in a DNA strand exhibit short, recurring patterns (also called sequence motifs) that are presumed to have a biological function. Identification of these patterns helps in understanding the biological information hidden in a DNA sequence. A human DNA consists of about 3 billion nucleotides and completion of genome sequencing of numerous model organisms has further proliferated genomic databases. To completely decipher, the biological information in a DNA sequence is a daunting task and development of fast, efficient, and cost effective computational techniques for the same is a big challenge.

A sequence pattern that plays a crucial role in the analysis of genomes is CpG Island (CGI). A typical CGI consists high-frequency of CpG dinucleoetides, where ‘p’ refers to the phosphodiester bond between the adjacent nucleotides[1, 2]. This bond is different from the hydrogen bond that exists between C and G across two strands in a DNA double helix. The length of a CGI varies from a few hundred to a few thousand base pairs (bp), but rarely exceeds 5000 bp. It is known that CpG Islands (CGIs) occur in and around the promoter regions of (50–60)% of human genes, including most housekeeping genes (the genes which are essential for general cell functions)[3]. Gene is a stretch of DNA sequence which has biological information for the synthesis of a protein. The promoter region in a gene regulates its functionality[47]. Due to the association of CGIs with promoters, CGIs play an important role in promoter prediction and consequently in the prediction of genes[8, 9]. CGIs also contribute significantly in discovering the epigenetic causes of cancer. CGIs located in the promoter regions of certain tumor suppressor genes are normally unmethylated in healthy cells. DNA methylation is a biochemical modification resulting from addition of a methyl group to cytosine nucleotide (C). In cancer cells, CGIs usually undergo a dense hypermethylation leading to gene silencing as shown in Figure 1. Owing to this, they can be used as candidate regions for aberrant DNA methylation, for early detection of cancer[1014]. For these reasons, identification of CGIs has become indispensable for genome analysis and annotation.

Figure 1
figure 1

Difference between mythelated and unmythelated CpG Island.

Despite their accuracy, experimental methods employed by biologists for identification of CGIs are extremely time-consuming, simply because of the enormity of genomic data. On the other hand, computational methods can be much more attractive for the identification of possible CGIs. The results obtained from computational methods can be used by biologists to validate and further enhance the accuracy of identified CGI locations. There are several computational methods[1526] reported in the literature for identification of CGIs in DNA sequences. In one of the first computational attempts[15], a CGI is defined as a DNA segment fulfilling the following three conditions: (i) length of segment is at least 200 bp, (ii) G and C contents are ≥ 50%, and (iii) observed CpG to expected CpG ratio (o/e) is ≥ 0.6. Observed CpG is the number of CpG dinucleoetides in a segment and expected CpG is calculated by multiplying the number of ‘C’s and the number of ‘G’s in a segment and then dividing the product by length of the segment. This method however falsely identifies the other G and C rich motifs, e.g., Alu repeats, as CGIs. In subsequent methods, these three conditions were made more stringent in order to reduce false identification at the expense of missing some true CGIs[24]. Sophisticated methods utilizing two Markov chain models[27, 28], one for CGIs and the other for non-CGIs, are proposed[2, 25, 26]. These two Markov models differ in their respective model parameters which characterize the difference in transition probabilities between successive nucleotides in CGIs and non-CGIs, respectively. In these methods, a DNA segment is defined as CGI, if the log-score[2] computed using Markov model for a CGI is greater than that computed using Markov model for a non-CGI. Consequently, the model parameters used for CGIs and non-CGIs play a crucial role in identifying the CGIs. However, different methods employing such models from time-to-time produce inconsistent results. Another criterion based on the physical distance distribution of CpG dinucleoetides in a DNA segment has also been proposed[23]. Methods based on this criterion are dependent on nucleotide composition of a DNA sequence being analyzed and suffer from low identification specificity.

Recently, digital signal processing (DSP)-based algorithms have gained popularity for the analysis of genomic sequences since they can be mapped to numerical sequences. Digital filters have successfully been employed for identification of protein coding regions (exons) in DNA sequences and hot-spots in protein sequences[2933]. Digital filters have also been used for identification of CGIs with considerable success[25, 26]. These methods are similar to Markov chain methods but use digital filters to compute weighted log-score to identify CGIs. The method proposed in[25] employs a bank of IIR low-pass filters (about 40 filters, each with different bandwidth) to identify the CGIs by looking at the weighted log-scores of all the filters together. The CGI identification sensitivity of this method is affected by the tradeoff between responsiveness of filter and stability of the output. Moreover, this method may become computationally demanding as it makes use of a large number of filters in the bank. Another DSP based algorithm in[26] employs an underlying multinomial statistical model[34] to estimate its Markov chain parameters followed by an FIR filter with Blackman window to compute the weighted log-score.

It is evident from above discussion that the CGI identification methods and more importantly the criteria used therein play a crucial role in identifying CGIs. As such, development of fast and efficient computational methods with highly reliable CGI identification criteria is a necessity. Statistically optimal null filters (SONF) have been proven for their ability to efficiently estimate short-duration signals embedded in noise[35]. In this article, we propose a new DSP algorithm for identification of CGIs using SONF which combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the message signal, characterizing the CGI, embedded in noise. Normally, the CGI identification accuracy is a lot dependent on the Markov models used and sometimes produces contrasting results. Also, one of the main objectives of the article is to find a uniform yet effective alternative CGI identification measure replacing the current measure based on transition probabilities. In the proposed scheme, we have formulated a simple basis function to be used in SONF which characterizes the CGI. Our criterion is devoid of any ambiguities associated with the choice of transition probabilities used in some of the algorithms. The proposed scheme is tested on a large number of already annotated DNA sequences obtained from human chromosomes 21 and 22. It is shown that our scheme is simple to implement and yet able to identify CGIs reliably and efficiently.

The rest of the article is organized as follows: the following section briefly describes a few existing DSP-based algorithms for the identification of CGIs. In Section “Proposed scheme”, the proposed SONF-based scheme for identifying CGIs in DNA sequences is explained. Results obtained from the proposed scheme are depicted as well as tabulated in Section “Results and discussion”. Finally, “Conclusion” section concludes the article describing some of the significant features of the proposed scheme.

Related study

In this section, we give a brief review of some of the existing CGI identification methods as a preparatory groundwork for the method to be proposed in Section “Proposed scheme”.

Markov chain approach

In this method, a DNA sequence of length N, represented as X = {x(n),x(n + 1),…,x(n + N−1)} where each symbol x(n){A C T G} is considered as a first-order Markov chain[27] due to its conditional independence property, i.e., the nucleotide occurring at the location (n−1) does not offer any information over and above that at n to predict the nucleotide occurring at (n + 1). In a CpG island, the probability of transition from the nucleotide base C to the base G is higher in comparison with that in a non-CGI. Let the probability of transition from a nucleotide β to a nucleotide γ in a CGI and a non-CGI be denoted as p β γ + and p β γ respectively. Tables1 and2 taken from[2] show the transition probabilities for CGI and non-CGI Markov models. These tables are derived from 48 putative CGIs in human DNA sequences. Each row in the tables contains transition probabilities from a specific nucleotide base to each of the four bases. These transition probabilities p β γ ± are calculated using

p β γ ± = n β γ ± k { A , T , G , C } n β k ±
(1)
Table 1 Transition probabilities inside a CGI
Table 2 Transition probabilities inside a non-CGI

where n β γ ± is the number of dinucleoetides βγ in a DNA sequence. Naturally, every row in the tables adds up to unity. As expected, in Table1, which corresponds to the CGI Markov model, the probability that a C is followed by a G is very high as compared with that in Table2.

The CGIs in the DNA sequence X are identified by analyzing the windowed sequence X n = {x(n),x(n + 1),…,x(n + L−1)} of length L, and those obtained by shifting the window by one position at a time. The probability of observing a windowed sequence assuming that it belongs to a CGI is given by

P ( X n | CGI ) = P ( x ( n ) x ( n + L 1 ) | x ( n 1 ) , CGI model ) = i = 0 L 1 p x ( n 1 + i ) x ( n + i ) +
(2)

Similarly, the probability of observing this sequence assuming it belongs to a non-CpG island region is

P ( X n | non - CGI ) = P ( x ( n ) x ( n + L 1 ) | x ( n 1 ) , non - CGI ) = i = 0 L 1 p x ( n 1 + i ) x ( n + i ) +
(3)

If the value of P(X n |CGI) > P(X n |non-CGI), then, it is concluded that the DNA sequence X n belongs to a CGI. Otherwise, it is more likely to be a non-CGI island. Alternatively, by formulating a log-likelihood ratio, given by

S(n)= 1 L log P ( X n | CGI ) P ( X n | non - CGI )
(4)

If S(n) > 0, the given DNA sequence is more likely to belong to a CGI, and if S(n) < 0 the sequence probably belongs to a non-CGI region.

IIR low-pass filter approach

Yoon and Vaidyanathan[25] have noted that the log-likelihood ratio given in (4) can be expressed as:

S ( n ) = 1 L log n = 0 L 1 p x ( n 1 ) x ( n ) + p x ( n 1 ) x ( n ) = 1 L i = 0 L 1 y ( n + i ) = y ( n ) h ave ( n )
(5)

where y(n) is a sequence representing the log-likelihood ratio of a single transition given by

y(n)=log p x ( n 1 ) x ( n ) + p x ( n 1 ) x ( n )
(6)

and, h ave (n) is a simple averaging filter defined as

h ave (n)= 1 / L , for L + 1 n 0 0 , otherwise .
(7)

Then, they proposed using a bank of M filters each having different bandwidth, instead of using simply one low-pass filter have(n). Specifically, the filter used in the k th (k = 0,…,M−1) channel has a transfer function given by

H k (z)= 1 α k 1 α k z 1
(8)

where 0 < α0 < α1 < < αM−1 < 1. Since impulse response of a filter in the bank is h ave k = 1 α k α k k u n more recent inputs are given larger weights than the past ones in the averaging process of y(n). The filter bank consists of 40 channels (M = 40), and the filter parameter α k is chosen from 0.95 to 0.99 with an increment of 0.001. The log-likelihood ratio obtained from the output of the k th channel is given by

S k (n)=y(n) h k (n)
(9)

The values of S k (n) obtained for all k and n are then used to obtain a two-level contour plot. The bands corresponding to S k (n) > 0 determine the locations of CGIs.

In this method, the use of filter bank increases the computational overhead considerably. For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 to compare with other methods (this reduces the number of computations considerably).

Multinomial statistical model

This method by Rushdi and Tuqan[26] differs from the previous method by the way the transition tables are obtained and the type of digital filter used to calculate the log-likelihood ratio. Instead of using (1) to obtain the transition probability tables, they are generated by comparing the frequency of each dinucleotide with the one expected under a multinomial model[34]. Transition probabilities p β γ ± for the windowed sequence X n are calculated using

p β γ ± = c β γ ± k { A , T , G , C } c β k ±
(10)

where

c β γ ± = frequency β γ ± ( n ) ( frequency β ± ( n ) ) ( frequency γ ± ( n ) )
(11)

This method uses a FIR digital filter with variable coefficients generated by Blackman window to calculate the log-likelihood ratio S(n) given in (4). The locations with S(n) greater than zero are the probable locations of CGIs.

All of the above-mentioned methods rely on the transition probability tables to calculate log-likelihood ratio used to identify CGIs. The methods[25, 26] specifically vary by the way y(n), obtained from the respective transition tables, are averaged. It is shown later in Section “Results and discussion” that the choice of the transition tables may produces contrasting results. Hence, a more reliable and efficient scheme that is devoid of these transition tables is necessary for identifying CGIs.

Proposed scheme

In this study, we adopt the SONF approach, proposed in[35], to efficiently identify CGIs in DNA sequences. SONF is used for estimation of short duration signal, S n = {s(m)}, embedded in noise R n = {r(m)} by combining maximum signal-to-noise ratio and least squares optimization criteria. The implementation of the twofold optimization in SONF is shown in Figure 2, where an instantaneous matched filter (IMF) is first used to detect the presence of a short duration signal embedded in noise by maximizing the signal-to-noise ratio over variable-time observation interval m. The IMF output, I n , is then scaled by a locally generated function n , by least squares (LS) optimization procedure, to obtain the signal Y n , an estimate of S n . It has been shown that the SONF is equivalent to a Kalman filter with a much simpler implementation[35]. Also, SONF has the ability to track rapidly changing signals leading to more practical processing of short-duration signals[36, 37]. Therefore, the proposed scheme is expected to perform better in situations even if the CGIs are of very short length of the order of 200 bp.

Figure 2
figure 2

Statistically optimal null filter.

To be able to apply SONF approach to identify CGIs, the DNA sequence X, of length N, is first mapped to an appropriate binary numerical sequence X CG = {x CG (n)}. A sliding window of length L is used to evaluate if each of the windowed sequences, X n = {x CG (m)}, where n = 1,2,…,NL + 1 and m = n,n + 1,…,n + L−1, belong to a CGI or not. Each of the windowed sequence X n can be expressed as

X n = S n + R n
(12)

where S n = {s(m)} is a message signal corresponding to a CGI and R n = {r(m)} is a residual signal. S n and R n are each of length L. Let Φ = {ϕ(m)} be a fixed binary basis sequence of length L having some characteristic property of CGI.

Now, the message signal corresponding to a CGI can be expressed as S n = V n Φ, where V n = {v(m)} and Φ are sequences each of length L. The sequence V n Φ is obtained by multiplying the corresponding elements of V n and Φ. The sequence V n is determined by minimizing R n in least square sense. Let the message signal be denoted as S n = {s(m)}. The objective of the proposed method is to choose the basis sequence such that V n resulting from the optimization process has some discriminating feature of indicating whether the associated sequence X n belongs to a CGI. The following subsections explain in detail the steps involved in identification of CGIs in a DNA sequence using SONF.

Numerical mapping of DNA sequences

As DNA sequences are alphabetical in nature, they need to be mapped to numerical sequences in order to employ the DSP techniques for DNA sequence analysis. There are several mapping techniques reported in the literature. One of the earliest and a popular mapping is that of Voss’s binary indicator sequences[38]. A DNA sequence X can be mapped to a set of four digital signals by forming four binary indicator sequences, namely, X A , X T , X G , and X C . In each of these binary indicator sequences, ’1’ represents the presence and ’0’ absence of the corresponding bases A, T, G, and C in X. For instance, considering a DNA sequence X = {ATCCGAAGTATAACGAA}, the binary indicator sequence corresponding to G, i.e., X G can be expressed as X G = {00001001000000100}. Indicator sequences for the remaining three nucleotides can be represented in a similar fashion.

The problem of CGI identification deals with G and C content in a DNA sequence. Hence, we define a new indicator sequence X CG = {x CG (n)}, which indicates the presence of the nucleotides C and G in the DNA sequence. For example, the binary indicator sequence X CG of the DNA sequence above is X CG = {00111001000001100}.

Choosing the basis sequence

In this study, we have noticed that the dinucleotides CC, CG, GC, and GG occur more frequently in a CGI as compared to a non-CGI. For this study, we have calculated the occurrence of these four dinucleotides in the sequence L44140 taken from the chromosome X of Homo sapiens. The sequence L44140 is of length 219447 bp and has 17 CGIs whose locations are obtained from[39]. Figure 3 depicts the relative occurrence of the above four dinucleotides as compared to the remaining dinucleotides (AA, AC, AG, AT, CA, CT, GA, GT, TC, TG, TT, and TA) in CGIs and non-CGIs of L44140. Here, the relative occurrence of a particular dinucleotide is equal to the number of times that dinucleotide occurs in the sequence divided by the sequence length. It is evident that the dinucleotides CC, CG, GC, and GG occur more frequently in CGIs whereas the other dinucleotides occur more frequently in non-CGIs. This observation can also be inferred from the transition probability tables (Tables1 and2) as the values of p β γ + are greater than p β γ , where β and γ are either G or C. In Figure 3, the darker bars corresponding to the dinucleotides CC, CG, GC, and GG are taller in CGIs, whereas the darker bars corresponding to the other dinucleotides are shorter. Hence, instead of just considering the difference in relative occurrence of CG, it is more productive to consider the relative occurrence of the dinucleotides CC, CG, GC, and GG to distinguish between a CGI and a non-CGI.

Figure 3
figure 3

Comparison of relative occurrence of dinucleotides in CGIs and non-CGIs of L44140.

Moreover, we have studied the difference in gap sizes between the dinucleotides CC, CG, GC, and GG in CGIs and non-CGIs of L44140. The shortest possible gap is of size 0 when the dinucleotides are adjacent to each other. Figure 4 shows the relative occurrence of gaps of various sizes in a CGI and a non-CGI. Here, relative occurrence of a particular gap size is equal to the number of times that gap size occurs in the sequence divided by the sequence length. Obviously, the gap of size 0 occurs more frequently in a CGI as compared to that of a non-CGI. And, it is found that the gap size in a non-CGI can go up to 40 where as in CGIs the maximum gap size was found to be 19. It can also be seen that the gaps of sizes 0, 1, and 2 occur more frequently in a CGI and the gap sizes of 3 and greater occur more frequently in a non-CGI. A gap of size 2 is the largest gap which can distinguish between a CGI and a non-CGI.

Figure 4
figure 4

Relative occurrence of various gap sizes in CGIs and non-CGIs of L44140.

Based on the above observations, the basis sequence which characterizes a CGI can be formulated as Φ = {1100110011…001100}. The 1’s in Φ represent either the nucleotide C or G. The 1’s always appear in pairs where each pair representing one of the dinucleotide CC, CG, GC, or GG. The 0’s in Φ form the gap between the dinucleotides. A gap size of 2 is chosen between the dinucleotides. This choice of Φ is also satisfies the basic criteria of a CGI, i.e., at least 50% of the nucleotide content in a CGI is due to C and G.

Now, in order to obtain the length of Φ (window size), we have analyzed CGIs and non-CGIs of different lengths for the relative occurrence of various gap sizes. Figure 5 shows the plot of Δ versus window size for various gap sizes. Here, Δ is the difference of relative occurrence of a particular gap in a CGI and a non-CGI for a fixed window length. It can be seen that Δ is maximum for gap size 0. As the window size increases Δ also increases before it reaches a steady value. Δ is negative for gap sizes of 3 and greater signifying that the gap sizes of 3 and higher are more probable in non-CGIs compared to CGIs. For the gap size 2, Δ stabilizes for window sizes greater than 200. Larger the window size, larger the number of computations, and hence in the proposed method we have used the length of Φ (window size) to be equal to 200.

Figure 5
figure 5

Difference of relative occurrence of a particular gap in a CGI and a non-CGI for different window lengths.

IMF

The objective of IMF, which is the first stage of SONF shown in Figure 2, is to detect the presence of the waveform Φ in the input sequence X n . IMF is an improvement over a matched filter, the difference being, in IMF optimal SNR is repeatedly calculated at every sample m, over an observation interval m[n,n + L−1]. IMF takes X n and Φ as inputs and produces an output sequence I n = {ι(m)} where

ι(m)= i = n m x CG (i)ϕ(i)
(13)

for m = n,n + 1,…,n + L−1. It can be seen that at each sample m, ι(m) is calculated over a varying interval i[n,m]. Note that, assuming ι(n−1) = 0, ι(m) can also be calculated using the recursive relation given by

ι(m)=ι(m1)+ x CG (m)ϕ(m).
(14)

The output ι(m) leads to an optimal detection of Φ at each sample m, and can be expressed as

ι(m)=v(m)c(m)+ r 0 (m)
(15)

where r 0 (m) is the residual signal in IMF output, and c(m) is given by

c(m)= i = n m ϕ 2 (i).
(16)

The v(m)V n in (15) is an unknown gain.

Least square optimization of the IMF output

The objective of the second stage in SONF is to determine a sequence = {λ(m)}, which when used to scale the IMF output I n , produces the SONF output, Y n , such that Y n V n Φ. Here, Y n is an element wise product of V n and Φ. Y n is an estimate of S n , which is the message signal corresponding to CGI.

Let us consider the suboptimal case in which a sample of the IMF output ι(m) in (15), when scaled by λ(m)=ϕ(m)/c(m), generates

y ( m ) = v ( m ) c ( m ) + r 0 ϕ ( m ) c ( m ) = v ( m ) ϕ ( m ) + r 0 ( m ) = s ( m ) + r 0 ( m )
(17)

where y(m) is an element of the SONF output, Y n . As we desire optimal null filtering, i.e., y(m) = s(m), the residual element, r0(m), needs to be entirely eliminated.

Before determining the optimal n , corresponding to ideal null filtering, we define the sequence Z n = {z(m)} such that,

z ( m ) = x CG ( m ) y ( m ) = s ( m ) + r ( m ) λ ( m ) ι ( m )
(18)

Ideally, y(m) = s(m) and from (18), zideal(m) = r(m). Now, the optimal n ={λopt(m)} is determined by minimizing the mean square error,E[ e λ 2 (m)], with respect to λ(m) where

e λ (m)= z ideal (m)z(m).
(19)

The optimal post IMF scaling sequence λopt(m) obtained by carrying out the above mean square minimization is[35]

λ opt (m)= ϕ ( m ) c ( m ) + 1 / SNR
(20)

where SNR is the input signal-to-noise ratio (considering r(m) to be noise).

In order to implement SONF, the value of the input SNR should be known. To circumvent this problem, a suboptimal case, as shown in (17), is assumed considering c(m) >> 1/SNR, leading to

λ subopt (m) ϕ ( m ) c ( m )
(21)

It can be shown that as m increases, λ subopt (m) λ opt (m) because the second term in the equation

λ subopt ( m ) λ opt ( m ) =1+ 1 ( SNR ) c ( m )
(22)

approaches zero (as the value of c(m) progressively increases with m). So, the value of initial SNR in (20) will influence only the starting few samples in Y n .

The SONF can easily be implemented recursively using the following equations[35]

ι ( m ) = ι ( m 1 ) + x CG ( m ) ϕ ( m ) P ( m ) = P ( m 1 ) P ( m 1 ) ϕ ( m ) ϕ ( m ) P ( m 1 ) 1 + ϕ ( m ) P ( m 1 ) ϕ ( m ) λ ( m ) = P ( m ) ϕ ( m ) y ( m ) = ι ( m ) λ ( m )
(23)

In this case of DNA analysis, one may choose the initial value of the gain P(0) to be 1 and ι(0) = ι(1).

The proposed SONF-based CGI identification algorithm for a DNA sequence of length N can now be summarized as follows:

Initialization: Set the base location index n = 0.

  • Step 1: Apply a rectangular window of length L = 200 starting at the base location n of the DNA sequence X to obtain the windowed sequence X n .

  • Step 2: Obtain the binary indicator sequence X CG for the windowed sequence, X n , from Step 1.

  • Step 3: X CG from Step 2, along with the binary basis sequence Φ, form the inputs to SONF. The corresponding SONF output sequence, Y n , is evaluated using the recursive relations given in (23), by assuming P(0) = 1 and ι(0) = ι(1).

  • Step 4: Compute the SNR power gain G(X n ), which is the ratio of the variance of the SONF output, Y n , to the variance of the corresponding input X n .

  • Step 5: Increment the value of n by 1, i.e., n = n + 1. If n ≤ (NL) go to Step 1, else go to Step 7.

  • Step 6: Plot G(X n ) as a function of n + L and get its upper envelope. The peaks in the resulting plot which are above the threshold, η, indicate the locations of CGIs identified in X.

  • Step 7: Exit the algorithm.

Figure 6 shows the SONF implementation for better understanding of the proposed approach. Figure 6a,b shows an example of a CGI and a non-CGI with 80 bp. Naturally, in Figure 6a there are greater number of ones. Figure 6c,d shows the IMF output for a CGI and a non-CGI, respectively. It can be seen that the IMF output corresponding to a CGI progressively increases to a greater value of 35 as compared to 6 of that of a non-CGI. Figure 6e,f is the scaling functions for a CGI and a non-CGI, respectively. They are obtained using the relation λ(m) = P(m)ϕ(m) in (23). Finally, Figure 6g,h shows the estimated CGI characteristic in a CGI and a non-CGI, respectively. The SONF output corresponding to a CGI has greater amplitude as compared with that of a non-CGI.

Figure 6
figure 6

SONF implementation: (a) An example of a CGI; (b) An example of a non-CGI; (c) IMF output for CGI; (d) IMF output for non-CGI; (e) Scaling function for CGI; (f) Scaling function for non-CGI; (g) SONF output for CGI; and, (h) SONF output for non-CGI.

Prediction measures

The identification of CGIs can have four possible outcomes; true positive (TP), true negative (TN), false positive (FP), or false negative (FN) as shown in Figure 7. Two basic measures of determining the accuracy of prediction are sensitivity (Sn) and specificity (Sp)[40]. Sensitivity, given by

Sn= TP TP + FN
(24)
Figure 7
figure 7

Four possible outcomes of CGI prediction.

and is defined as the proportion of CGIs that have been predicted correctly. Whereas, specificity given by

Sp= TP TP + FP
(25)

is defined as the proportion of the predicted CGIs that are real. Sensitivity and specificity can take on values from 0 to 1. For a perfect prediction, Sn = 1 and Sp = 1. Neither sensitivity nor specificity alone can provide a good measure of the global accuracy, because high sensitivity can be achieved with little specificity and vice versa. A measure that combines sensitivity and specificity values is called the correlation coefficient (CC) and is given by

CC= ( TP × TN ) ( FN × FP ) ( TP + FN ) ( TN + FP ) ( TP + FP ) ( TN + FN )
(26)

The value of CC ranges from −1 to 1, where a value of 1 corresponds to a perfect prediction; a value of −1 indicates that every CGI has been predicted as non-CGI, and vice versa.

Another measure, called the performance accuracy (Acc), used in our analysis is given by

Acc= TP + TN TP + FP + TN + FN
(27)

In this article, we have evaluated the performance of different CGI identification methods at the nucleotide level. For example, the value of TP is obtained by adding all the nucleotides predicted to to true positive, and the other outcomes are calculated in the similar manner. At the CGI level, even if one nucleotide (or a threshold of a minimum number of nucleotides) corresponding to a CGI is predicted to be true positive the entire CGI is assumed to be predicted correctly.

Results and discussion

The proposed CGI prediction scheme is tested on several genomic sequences of varying lengths taken from the human chromosomes 21 and 22. More precisely, we have used the three contigs, NT_113952.1, NT_113954.1, and NT_113958.2 from chromosome 21, and the contig NT_028395.3 from chromosome 22 for our analysis. All the sequence data considered for this study are obtained from the GenBank Database[39]. The performance of the proposed scheme is compared with the other popular DSP-based approaches such as Markov chain[2], IIR low-pass filters[25], and multinomial model[26].

First, a DNA sequence from human chromosome X with the GenBank accession number of L44140 is analyzed for illustrative purpose. The sequence is of length 219447 bp and is already annotated, i.e., the locations of its CGIs are already known and can be obtained from[39]. The sequence L44140 is also used to obtain the values of threshold, η, used by the DSP-based methods being compared in this article.

Figure 8 shows the comparative performance of CGI prediction by the above-mentioned four approaches. Figure 8a shows the performance of Markov chain approach, where log-likelihood ratio S(n) is plotted against base index of the sequence. The transition probability tables given in Tables1 and2 are used to calculate S(n). All the base locations, n, with S(n) > 0 imply that they are very likely to be a part of a CGI. A window length of 200 bp is considered for the method. Markov chain method is able to detect most of the CGIs in the DNA sequence and it can be seen that the CGIs and non-CGIs can reasonably be differentiated by looking at the sign of S(n). However, one of the major drawbacks of this method is the presence of a lot of false positives that falsely categorize non-CGIs into CGIs.

Figure 8
figure 8

CGI prediction in the DNA sequence L44140 using (a) Markov chain method; (b) IIR Filter method; (c) Multinomial model; (d) SONF scheme.

Figure 8b shows the performance of IIR low-pass filter approach where the log-likelihood ratio, S(n), is plotted against base index of the sequence, n. The transition probability tables given in[25] are used to calculate S(n). For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 for this method. All the base locations, n, with S(n) > 0 imply that they are very likely to be a part of a CGI. A window length of 200 bp is considered for the method. Similar to the Markov chain method, this method also produces a lot of false positives affecting the prediction accuracy.

Figure 8c shows the prediction of CGIs using the multinomial model in[26]. An underlying multinomial statistical model is employed to estimate the Markov chain model parameters that result in the transition probability tables given in[26]. A Blackman window of length 100 bp is employed for calculating the filtered log-likelihood ratio. The Blackman window gives larger weights for central samples of the window, thus reducing the edge effects. Windows with the positive filtered log-likelihood ratio are considered to be a part of a CGI. This method shows considerably high false positives making the CGI prediction unreliable.

Figure 8d shows performance of the proposed SONF scheme in predicting the CGIs. Unlike the above-mentioned methods, our scheme utilizes the binary basis sequence, Φ, instead of the probability transition tables. The proposed scheme first maximizes SNR of the output at each time instant using IMF, then it further enhances the estimated signal using least-square optimization criterion, to estimate the presence of Φ in the input windowed DNA sequence. A window size of 200 is used for the proposed method. Effectiveness of the proposed scheme is clearly visible in Figure 8d, which depict more contrasting peaks as compared to the other three approaches. These contrasting peaks make the identification process comparatively easier resulting in less number of false positives.

It can be seen from Figure 8 that the default threshold on η = 0 produces a lot of false positives for the methods using transition probability tables. The optimal threshold values for the methods is obtained by calculating the prediction Acc for varying thresholds for each method (Figure 9). The optimal values of thresholds obtained for the Markov chain method, IIR filter method, and the proposed SONF approach are 0.1, 0.05, and 0.6, respectively. The actual locations of the CGIs, obtained from NCBI website, present in the sequence L44140 are represented by red horizontal spots in Figure 8. Figure 10 is receiver operating characteristic (ROC) curves plotted for the four methods. It can be seen that the proposed approach has better overall performance for the sequence L44140 with the area under the curve 0.7460. The Markov chain method is next with the area under the ROC curve 0.6072. The area under the curve for IIR filter method is 0.3106. It can be seen that the multinomial model method has the least area under the ROC curve. The dismal performance of the multinomial model does not indicate anything about the method in itself but merely implies that the transition probability tables used may not be appropriate for the example considered.

Figure 9
figure 9

Relation between the performance Acc and threshold.

Figure 10
figure 10

ROC curves obtained for the sequence L44140.

We have evaluated the time complexity of the proposed method using the tic-toc function in MATLAB. Taking the necessary precautions (such as all applications except MATLAB were closed, a fresh session of MATLAB was started for each task, and MATLAB was warmed up with the code, i.e., the first run of the code was ignored), the CPU time for processing a fixed length of sequence, the Markov chain method was found to be the least followed by SONF, IIR and multinomial approaches with an additional CPU time of 1.29%, 1.78%, and 1.82%, respectively. This difference is not substantial considering today’s computing resources.

Figure 11 shows the performance of the four methods for the prediction of CGIs in the first 15000 bps of L44140. The red horizontal lines are the actual locations of CGIs. The blue binary decision curve depicts the locations of the predicted CGI by the methods. As can be seen from Figure 11c, the multinomial-based approach fails to detect the CGI located between base pairs 3095 and 3426 as opposed to other three methods implying that the probability transition parameters used for the CGI identification play a crucial role. Hence, it is important to have a CGI identification characteristic which is devoid of any ambiguity with the choice of different probability transition tables available. The binary basis sequence Φ in the proposed scheme successfully identifies the CGIs and can be reliably used as CPG identification characteristic.

Figure 11
figure 11

CGI prediction in the first 15000 bps of L44140 using (a) Markov chain method; (b) IIR filter method; (c) multinomial model; (d) SONF scheme. Binary decision based on respective threshold is plotted against the base location index

Table3 presents the summary of performance measures Sn, Sp, CC, and Acc obtained for the analysis of four contigs NT_113952.1, NT_113954.1, NT_113958.2, and NT_028395.3. The performance of the proposed scheme is also compared with that of CpGCluster[23], which uses the distance between CpG dinucleotides (and not the transition probability tables) for identifying CGIs. The proposed approach has the highest values of Sn for all the contigs (shown in bold) and has the highest values of CC for the contigs NT_113954.1 and NT_113958.2. The performance accuracy is also quiet high, consistently above 97% which is a good sign. This shows that the proposed method is reliable and the proposed binary basis sequence Φ is an alternative CGI identification characteristic. The multinomial method did not identify any of the CGIs in the contig NT_028395.3 and hence its Sn and Sp values are zero. The corresponding Acc value is high because the method predicting most of the true negatives correctly. The contig NT_028395.3 has short CGIs of the order of 200 bps and the proposed approach with better sensitivity is capable of identifying them.

Table 3 Comparison of different methods for identification of CGIs

Conclusion

In this article, a new DSP-based technique using SONFs is proposed for the prediction of CGIs in DNA sequences. A novel CPG identification characteristic is presented in the form of a binary basis sequence which is shown to identify CGIs reliably. It has also been shown that the performance of the existing methods which use discriminating transition probability tables for CGIs/non-CGIs is not consistent. The prediction accuracy of these methods are highly dependent on the training data used to obtain the transition probabilities of CGIs and non-CGIs. The inability of finding a unique CGI identification characteristic has resulted in failure in predicting many of the CGIs. This article makes an attempt to present a unique CGI identification characteristic which does not require any training. Furthermore, the ability of SONF to track short duration signals is exploited in identifying the CGIs in DNA sequences. SONF combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The performance of the proposed technique is tested on four randomly chosen contigs in chromosomes 21 and 22 of human beings. The simulation results comparing the performance of the proposed technique with the other three DSP-based CGI prediction techniques have shown that the proposed approach enjoys superior prediction accuracy in terms of sensitivity. The overall predicting accuracy of the proposed approach is also consistently above 97% and is comparable to that of the Markov chain method making it a reliable method.

References

  1. Lodish H, Berk A, Zipursky S, Matsudaira P, Baltimore D, Darnell J: Molecular Cell biology. Scientific American, New York,; 1995.

    Google Scholar 

  2. Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press, Cambridge,; 1998.

    Book  MATH  Google Scholar 

  3. Antequera F, Bird A: Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA 1993, 90(24):11995-11999. 10.1073/pnas.90.24.11995

    Article  Google Scholar 

  4. Antequera F, Bird A: CpG islands as genomic footprints of promoters that are associated with replication origins. Curr. Biol 1999, 9: 661-667. 10.1016/S0960-9822(99)80290-5

    Article  Google Scholar 

  5. Ioshikhes I, Zhang M: Large-scale human promoter mapping using CpG islands. Nat. Genet 2000, 26: 61-63. 10.1038/79189

    Article  Google Scholar 

  6. Antequera F: Structure, function, evolution of CpG island promoters. Cell. Mol. Life Sci 2003, 60(8):1647-1658. 10.1007/s00018-003-3088-6

    Article  Google Scholar 

  7. Saxonov S, Berg P, Brutlag D: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl Acad. Sci. USA 2006, 103(5):1412-1417. 10.1073/pnas.0510310103

    Article  Google Scholar 

  8. Larsen F, Gundersen G, Lopez R, Prydz H: CpG islands as gene markers in the human genome. Genomics (San Diego, CA) 1992, 13(4):1095-1107.

    Google Scholar 

  9. Wang Y, Leung F: An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics 2004, 20(7):1170. 10.1093/bioinformatics/bth059

    Article  Google Scholar 

  10. Bird A: DNA methylation patterns and epigenetic memory. Genes Dev 2002, 16: 6-21. 10.1101/gad.947102

    Article  Google Scholar 

  11. Herman J, Baylin S: Gene silencing in cancer in association with promoter hypermethylation. New Engl. J. Med 2003, 349(21):2042. 10.1056/NEJMra023075

    Article  Google Scholar 

  12. Issa J: CpG island methylator phenotype in cancer. Nat. Rev. Cancer 2004, 4(12):988-993. 10.1038/nrc1507

    Article  Google Scholar 

  13. Illingworth R, Kerr A, DeSousa D, Jorgensen H, Ellis P, Stalker J, Jackson D, Clee C, Plumb R, Rogers J: A novel CpG island set identifies tissue-specific methylation at developmental gene loci. PLoS Biol 2008, 6: e22. 10.1371/journal.pbio.0060022

    Article  Google Scholar 

  14. Heisler L, Torti D, Boutros P, Watson J, Chan C, Winegarden N, Takahashi M, Yau P, Huang T, Farnham P: CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genome. Nucleic Acids Res 2005, 33(9):2952. 10.1093/nar/gki582

    Article  Google Scholar 

  15. Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J. Mol. Biol 1987, 196(2):261. 10.1016/0022-2836(87)90689-9

    Article  Google Scholar 

  16. Rouchka E, Mazzarella R, States David J: Computational detection of CpG islands in DNA, Report: WUCS-97-39. 1997.

    Google Scholar 

  17. Rice P, Longden I, Bleasby A: EMBOSS: the European molecular biology open software suite. Trends Genetics 2000, 16(6):276-277. 10.1016/S0168-9525(00)02024-2

    Article  Google Scholar 

  18. Ponger L, Mouchiroud D: CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 2002, 18(4):631. 10.1093/bioinformatics/18.4.631

    Article  Google Scholar 

  19. Dasgupta N, Lin S, Carin L: Sequential modeling for identifying CpG island locations in human genome. IEEE Signal Process. Lett 2002, 9(12):407-409.

    Article  Google Scholar 

  20. Luque-Escamilla P, Martínez-Aroza J, Oliver J, Gómez-Lopera J, Román-Roldán R: Compositional searching of CpG islands in the human genome. Phys. Rev. E 2005, 71(6):61925.

    Article  Google Scholar 

  21. Bock C, Walter J, Paulsen M, Lengauer T: CpG island mapping by epigenome prediction. PLoS Comput. Biol 2007, 3(6):e110. 10.1371/journal.pcbi.0030110

    Article  Google Scholar 

  22. Sujuan Y, Asaithambi A, Liu Y: CpGIF: an algorithm for the identification of CpG islands. Bioinformation 2008, 2(8):335-338. 10.6026/97320630002335

    Article  Google Scholar 

  23. Hackenberg M, Previti C, Luque-Escamilla P, Carpena P, Martínez-Aroza J, Oliver J: CpGcluster: a distance-based algorithm for CpG-island detection. BMC Bioinform 2006, 7: 446. 10.1186/1471-2105-7-446

    Article  Google Scholar 

  24. Takai D, Jones P: Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl Acad. Sci 2002, 99(6):3740-3745. 10.1073/pnas.052410099

    Article  Google Scholar 

  25. Yoon B, Vaidyanathan P: Identification of CpG islands using a bank of IIR lowpass filters. In Proceedings of 11 th Digital Signal Processing Workshop. Taos Ski Valley, New Mexico; Aug. 2004.

    Google Scholar 

  26. Rushdi A, Tuqan J: A new DSP-based measure for CpG islands detection. In Digital Signal Processing Workshop, 12th-Signal Processing Education Workshop, 4th. IEEE, Teton National Park, Wyoming; 2006.

    Google Scholar 

  27. Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257-286. 10.1109/5.18626

    Article  Google Scholar 

  28. Won K, Prugel-Bennett A, Krogh A: Evolving the structure of hidden Markov models. IEEE Trans. Evol. Comput 2006, 10: 39-49.

    Article  Google Scholar 

  29. Anastassiou D: Genomic signal processing. IEEE Signal Process. Mag 2001, 18(4):8-20. 10.1109/79.939833

    Article  Google Scholar 

  30. Vaidyanathan P, Yoon B: The role of signal-processing concepts in genomics and proteomics. J. Franklin Inst 2004, 341(1–2):111-135.

    Article  MATH  Google Scholar 

  31. Ramachandran P, Antoniou A: Identification of hot-spot locations in proteins using digital filters. IEEE J. Sel. Topics Signal Process 2008, 2(3):378-389.

    Article  Google Scholar 

  32. Rao K, Swamy M: Analysis of genomics and proteomics using DSP techniques. IEEE Trans. Circuits Syst. 1: Regular Papers 2008, 55: 358.

    MathSciNet  Google Scholar 

  33. Song N, Yan H: Short exon detection in DNA sequences based on multifeature spectral analysis. EURASIP J. Adv. Signal Process 2011, 2011: 2. 10.1186/1687-6180-2011-2

    Article  Google Scholar 

  34. Liu B: Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press, Boca Raton,; 1998.

    Google Scholar 

  35. Agarwal R, Plotkin E, Swamy M: Statistically optimal null filter based on instantaneous matched processing. Circuits Syst. Signal Process 2001, 20: 37-61. 10.1007/BF01204921

    Article  MATH  MathSciNet  Google Scholar 

  36. Kakumani R, Devabhaktuni V, Ahmad M: Prediction of protein-coding regions in DNA sequences using a model-based approach. In IEEE International Symposium on Circuits and Systems. Seattle; 2008.

    Google Scholar 

  37. Yadav R, Agarwal R, Swamy M: A new improved model-based seizure detection using statistically optimal null filter. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE. Minneapolis, Minnesota; 2009.

    Google Scholar 

  38. Voss R: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett 1992, 68(25):3805-3808. 10.1103/PhysRevLett.68.3805

    Article  Google Scholar 

  39. National Centre for Biotechnology Information http://www.ncbi.nlm.nih.gov

  40. Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353-367. 10.1006/geno.1996.0298

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported in parts by the Natural Sciences and Engineering Research Council (NSERC) of Canada and in part by the Regroupement Strategic en Microelectronique du Quebec (ReSMiQ).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajasekhar Kakumani.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kakumani, R., Ahmad, O. & Devabhaktuni, V. Identification of CpG islands in DNA sequences using statistically optimal null filters. J Bioinform Sys Biology 2012, 12 (2012). https://doi.org/10.1186/1687-4153-2012-12

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1687-4153-2012-12

Keywords