Identification of CpG islands in DNA sequences using statistically optimal null filters
 Rajasekhar Kakumani^{1}Email author,
 Omair Ahmad^{1} and
 Vijay Devabhaktuni^{2}
DOI: 10.1186/16874153201212
© Kakumani et al.; licensee Springer. 2012
Received: 16 February 2012
Accepted: 23 July 2012
Published: 29 August 2012
Abstract
CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a nonCGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a nonCGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signaltonoise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.
Introduction
In the recent years, computational methods for processing and interpreting vast amount of genomic data, generated from genome sequencing, have gained a lot of scientific interest. Genomic sequences such as deoxyribonucleic acid (DNA) consist of biological instructions which are crucial for the development and normal functioning of almost all living organisms[1]. A DNA molecule has a complex double helix structure that involves two strands, consisting of alternating sugar and phosphate groups. Attached to these sugar groups of each DNA strand are one of the four chemical bases, namely, adenine (A), thymine (T), guanine (G), and cytosine (C). A unit comprising of base, sugar, and phosphate is referred to as a nucleotide. Hydrogen bonds between the nucleotides A and T (similarly between nucleotides G and C) from the opposite strands not only stabilize the DNA molecule, but also make the two strands complimentary. Nucleotides in a DNA strand exhibit short, recurring patterns (also called sequence motifs) that are presumed to have a biological function. Identification of these patterns helps in understanding the biological information hidden in a DNA sequence. A human DNA consists of about 3 billion nucleotides and completion of genome sequencing of numerous model organisms has further proliferated genomic databases. To completely decipher, the biological information in a DNA sequence is a daunting task and development of fast, efficient, and cost effective computational techniques for the same is a big challenge.
Despite their accuracy, experimental methods employed by biologists for identification of CGIs are extremely timeconsuming, simply because of the enormity of genomic data. On the other hand, computational methods can be much more attractive for the identification of possible CGIs. The results obtained from computational methods can be used by biologists to validate and further enhance the accuracy of identified CGI locations. There are several computational methods[15–26] reported in the literature for identification of CGIs in DNA sequences. In one of the first computational attempts[15], a CGI is defined as a DNA segment fulfilling the following three conditions: (i) length of segment is at least 200 bp, (ii) G and C contents are ≥ 50%, and (iii) observed CpG to expected CpG ratio (o/e) is ≥ 0.6. Observed CpG is the number of CpG dinucleoetides in a segment and expected CpG is calculated by multiplying the number of ‘C’s and the number of ‘G’s in a segment and then dividing the product by length of the segment. This method however falsely identifies the other G and C rich motifs, e.g., Alu repeats, as CGIs. In subsequent methods, these three conditions were made more stringent in order to reduce false identification at the expense of missing some true CGIs[24]. Sophisticated methods utilizing two Markov chain models[27, 28], one for CGIs and the other for nonCGIs, are proposed[2, 25, 26]. These two Markov models differ in their respective model parameters which characterize the difference in transition probabilities between successive nucleotides in CGIs and nonCGIs, respectively. In these methods, a DNA segment is defined as CGI, if the logscore[2] computed using Markov model for a CGI is greater than that computed using Markov model for a nonCGI. Consequently, the model parameters used for CGIs and nonCGIs play a crucial role in identifying the CGIs. However, different methods employing such models from timetotime produce inconsistent results. Another criterion based on the physical distance distribution of CpG dinucleoetides in a DNA segment has also been proposed[23]. Methods based on this criterion are dependent on nucleotide composition of a DNA sequence being analyzed and suffer from low identification specificity.
Recently, digital signal processing (DSP)based algorithms have gained popularity for the analysis of genomic sequences since they can be mapped to numerical sequences. Digital filters have successfully been employed for identification of protein coding regions (exons) in DNA sequences and hotspots in protein sequences[29–33]. Digital filters have also been used for identification of CGIs with considerable success[25, 26]. These methods are similar to Markov chain methods but use digital filters to compute weighted logscore to identify CGIs. The method proposed in[25] employs a bank of IIR lowpass filters (about 40 filters, each with different bandwidth) to identify the CGIs by looking at the weighted logscores of all the filters together. The CGI identification sensitivity of this method is affected by the tradeoff between responsiveness of filter and stability of the output. Moreover, this method may become computationally demanding as it makes use of a large number of filters in the bank. Another DSP based algorithm in[26] employs an underlying multinomial statistical model[34] to estimate its Markov chain parameters followed by an FIR filter with Blackman window to compute the weighted logscore.
It is evident from above discussion that the CGI identification methods and more importantly the criteria used therein play a crucial role in identifying CGIs. As such, development of fast and efficient computational methods with highly reliable CGI identification criteria is a necessity. Statistically optimal null filters (SONF) have been proven for their ability to efficiently estimate shortduration signals embedded in noise[35]. In this article, we propose a new DSP algorithm for identification of CGIs using SONF which combines maximum signaltonoise ratio and least squares optimization criteria to estimate the message signal, characterizing the CGI, embedded in noise. Normally, the CGI identification accuracy is a lot dependent on the Markov models used and sometimes produces contrasting results. Also, one of the main objectives of the article is to find a uniform yet effective alternative CGI identification measure replacing the current measure based on transition probabilities. In the proposed scheme, we have formulated a simple basis function to be used in SONF which characterizes the CGI. Our criterion is devoid of any ambiguities associated with the choice of transition probabilities used in some of the algorithms. The proposed scheme is tested on a large number of already annotated DNA sequences obtained from human chromosomes 21 and 22. It is shown that our scheme is simple to implement and yet able to identify CGIs reliably and efficiently.
The rest of the article is organized as follows: the following section briefly describes a few existing DSPbased algorithms for the identification of CGIs. In Section “Proposed scheme”, the proposed SONFbased scheme for identifying CGIs in DNA sequences is explained. Results obtained from the proposed scheme are depicted as well as tabulated in Section “Results and discussion”. Finally, “Conclusion” section concludes the article describing some of the significant features of the proposed scheme.
Related study
In this section, we give a brief review of some of the existing CGI identification methods as a preparatory groundwork for the method to be proposed in Section “Proposed scheme”.
Markov chain approach
Transition probabilities inside a CGI
${\mathbf{p}}_{\mathbf{\beta}\mathbf{\gamma}}^{\mathbf{+}}$  A  C  G  T 

A  0.180  0.274  0.426  0.120 
C  0.171  0.368  0.274  0.188 
G  0.161  0.339  0.375  0.125 
T  0.079  0.355  0.384  0.182 
Transition probabilities inside a nonCGI
${\mathbf{p}}_{\mathbf{\beta}\mathbf{\gamma}}^{\mathbf{}}$  A  C  G  T 

A  0.300  0.205  0.285  0.210 
C  0.322  0.298  0.078  0.302 
G  0.248  0.246  0.298  0.208 
T  0.177  0.239  0.292  0.292 
where${n}_{\beta \gamma}^{\pm}$ is the number of dinucleoetides βγ in a DNA sequence. Naturally, every row in the tables adds up to unity. As expected, in Table1, which corresponds to the CGI Markov model, the probability that a C is followed by a G is very high as compared with that in Table2.
If S(n) > 0, the given DNA sequence is more likely to belong to a CGI, and if S(n) < 0 the sequence probably belongs to a nonCGI region.
IIR lowpass filter approach
The values of S_{ k }(n) obtained for all k and n are then used to obtain a twolevel contour plot. The bands corresponding to S_{ k }(n) > 0 determine the locations of CGIs.
In this method, the use of filter bank increases the computational overhead considerably. For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 to compare with other methods (this reduces the number of computations considerably).
Multinomial statistical model
This method uses a FIR digital filter with variable coefficients generated by Blackman window to calculate the loglikelihood ratio S(n) given in (4). The locations with S(n) greater than zero are the probable locations of CGIs.
All of the abovementioned methods rely on the transition probability tables to calculate loglikelihood ratio used to identify CGIs. The methods[25, 26] specifically vary by the way y(n), obtained from the respective transition tables, are averaged. It is shown later in Section “Results and discussion” that the choice of the transition tables may produces contrasting results. Hence, a more reliable and efficient scheme that is devoid of these transition tables is necessary for identifying CGIs.
Proposed scheme
where S_{ n }= {s(m)} is a message signal corresponding to a CGI and R_{ n }= {r(m)} is a residual signal. S_{ n }and R_{ n } are each of length L. Let Φ = {ϕ(m)} be a fixed binary basis sequence of length L having some characteristic property of CGI.
Now, the message signal corresponding to a CGI can be expressed as S_{ n }= V_{ n }Φ, where V_{ n }= {v(m)} and Φ are sequences each of length L. The sequence V_{ n }Φ is obtained by multiplying the corresponding elements of V_{ n }and Φ. The sequence V_{ n } is determined by minimizing R_{ n } in least square sense. Let the message signal be denoted as S_{ n }= {s(m)}. The objective of the proposed method is to choose the basis sequence such that V_{ n } resulting from the optimization process has some discriminating feature of indicating whether the associated sequence X_{ n } belongs to a CGI. The following subsections explain in detail the steps involved in identification of CGIs in a DNA sequence using SONF.
Numerical mapping of DNA sequences
As DNA sequences are alphabetical in nature, they need to be mapped to numerical sequences in order to employ the DSP techniques for DNA sequence analysis. There are several mapping techniques reported in the literature. One of the earliest and a popular mapping is that of Voss’s binary indicator sequences[38]. A DNA sequence X can be mapped to a set of four digital signals by forming four binary indicator sequences, namely, X_{ A }, X_{ T }, X_{ G }, and X_{ C }. In each of these binary indicator sequences, ’1’ represents the presence and ’0’ absence of the corresponding bases A, T, G, and C in X. For instance, considering a DNA sequence X = {ATCCGAAGTATAACGAA}, the binary indicator sequence corresponding to G, i.e., X_{ G }can be expressed as X_{ G }= {00001001000000100}. Indicator sequences for the remaining three nucleotides can be represented in a similar fashion.
The problem of CGI identification deals with G and C content in a DNA sequence. Hence, we define a new indicator sequence X_{ CG }= {x_{ CG }(n)}, which indicates the presence of the nucleotides C and G in the DNA sequence. For example, the binary indicator sequence X_{ CG } of the DNA sequence above is X_{ CG }= {00111001000001100}.
Choosing the basis sequence
Based on the above observations, the basis sequence which characterizes a CGI can be formulated as Φ = {1100110011…001100}. The 1’s in Φ represent either the nucleotide C or G. The 1’s always appear in pairs where each pair representing one of the dinucleotide CC, CG, GC, or GG. The 0’s in Φ form the gap between the dinucleotides. A gap size of 2 is chosen between the dinucleotides. This choice of Φ is also satisfies the basic criteria of a CGI, i.e., at least 50% of the nucleotide content in a CGI is due to C and G.
IMF
The v(m)∈V_{ n }in (15) is an unknown gain.
Least square optimization of the IMF output
The objective of the second stage in SONF is to determine a sequence ⋀ = {λ(m)}, which when used to scale the IMF output I_{ n }, produces the SONF output, Y_{ n }, such that${Y}_{n}\to {V}_{n}\Phi $. Here, Y_{ n } is an element wise product of V_{ n }and Φ. Y_{ n } is an estimate of S_{ n }, which is the message signal corresponding to CGI.
where y(m) is an element of the SONF output, Y_{ n }. As we desire optimal null filtering, i.e., y(m) = s(m), the residual element, r_{0}(m), needs to be entirely eliminated.
where SNR is the input signaltonoise ratio (considering r(m) to be noise).
approaches zero (as the value of c(m) progressively increases with m). So, the value of initial SNR in (20) will influence only the starting few samples in Y_{ n }.
In this case of DNA analysis, one may choose the initial value of the gain P(0) to be 1 and ι(0) = ι(1).
The proposed SONFbased CGI identification algorithm for a DNA sequence of length N can now be summarized as follows:
Initialization: Set the base location index n = 0.

Step 1: Apply a rectangular window of length L = 200 starting at the base location n of the DNA sequence X to obtain the windowed sequence X_{ n }.

Step 2: Obtain the binary indicator sequence X_{ CG }for the windowed sequence, X_{ n }, from Step 1.

Step 3: X_{ CG }from Step 2, along with the binary basis sequence Φ, form the inputs to SONF. The corresponding SONF output sequence, Y_{ n }, is evaluated using the recursive relations given in (23), by assuming P(0) = 1 and ι(0) = ι(1).

Step 4: Compute the SNR power gain G(X_{ n }), which is the ratio of the variance of the SONF output, Y_{ n }, to the variance of the corresponding input X_{ n }.

Step 5: Increment the value of n by 1, i.e., n = n + 1. If n ≤ (N−L) go to Step 1, else go to Step 7.

Step 6: Plot G(X_{ n }) as a function of n + L and get its upper envelope. The peaks in the resulting plot which are above the threshold, η, indicate the locations of CGIs identified in X.

Step 7: Exit the algorithm.
Prediction measures
The value of CC ranges from −1 to 1, where a value of 1 corresponds to a perfect prediction; a value of −1 indicates that every CGI has been predicted as nonCGI, and vice versa.
In this article, we have evaluated the performance of different CGI identification methods at the nucleotide level. For example, the value of TP is obtained by adding all the nucleotides predicted to to true positive, and the other outcomes are calculated in the similar manner. At the CGI level, even if one nucleotide (or a threshold of a minimum number of nucleotides) corresponding to a CGI is predicted to be true positive the entire CGI is assumed to be predicted correctly.
Results and discussion
The proposed CGI prediction scheme is tested on several genomic sequences of varying lengths taken from the human chromosomes 21 and 22. More precisely, we have used the three contigs, NT_113952.1, NT_113954.1, and NT_113958.2 from chromosome 21, and the contig NT_028395.3 from chromosome 22 for our analysis. All the sequence data considered for this study are obtained from the GenBank Database[39]. The performance of the proposed scheme is compared with the other popular DSPbased approaches such as Markov chain[2], IIR lowpass filters[25], and multinomial model[26].
First, a DNA sequence from human chromosome X with the GenBank accession number of L44140 is analyzed for illustrative purpose. The sequence is of length 219447 bp and is already annotated, i.e., the locations of its CGIs are already known and can be obtained from[39]. The sequence L44140 is also used to obtain the values of threshold, η, used by the DSPbased methods being compared in this article.
Figure 8b shows the performance of IIR lowpass filter approach where the loglikelihood ratio, S(n), is plotted against base index of the sequence, n. The transition probability tables given in[25] are used to calculate S(n). For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 for this method. All the base locations, n, with S(n) > 0 imply that they are very likely to be a part of a CGI. A window length of 200 bp is considered for the method. Similar to the Markov chain method, this method also produces a lot of false positives affecting the prediction accuracy.
Figure 8c shows the prediction of CGIs using the multinomial model in[26]. An underlying multinomial statistical model is employed to estimate the Markov chain model parameters that result in the transition probability tables given in[26]. A Blackman window of length 100 bp is employed for calculating the filtered loglikelihood ratio. The Blackman window gives larger weights for central samples of the window, thus reducing the edge effects. Windows with the positive filtered loglikelihood ratio are considered to be a part of a CGI. This method shows considerably high false positives making the CGI prediction unreliable.
Figure 8d shows performance of the proposed SONF scheme in predicting the CGIs. Unlike the abovementioned methods, our scheme utilizes the binary basis sequence, Φ, instead of the probability transition tables. The proposed scheme first maximizes SNR of the output at each time instant using IMF, then it further enhances the estimated signal using leastsquare optimization criterion, to estimate the presence of Φ in the input windowed DNA sequence. A window size of 200 is used for the proposed method. Effectiveness of the proposed scheme is clearly visible in Figure 8d, which depict more contrasting peaks as compared to the other three approaches. These contrasting peaks make the identification process comparatively easier resulting in less number of false positives.
We have evaluated the time complexity of the proposed method using the tictoc function in MATLAB. Taking the necessary precautions (such as all applications except MATLAB were closed, a fresh session of MATLAB was started for each task, and MATLAB was warmed up with the code, i.e., the first run of the code was ignored), the CPU time for processing a fixed length of sequence, the Markov chain method was found to be the least followed by SONF, IIR and multinomial approaches with an additional CPU time of 1.29%, 1.78%, and 1.82%, respectively. This difference is not substantial considering today’s computing resources.
Comparison of different methods for identification of CGIs
Contig.  Performace  Methods  

Markov Chain  IIR Filter  Multinomial model  CpGCluster  SONF  
NT_113952.1  Sn  0.8466  0.8656  0.4524  0.5046  0.8677 
Length = 184355  Sp  0.8728  0.8320  0.2833  0.9995  0.4457 
CC  0.8621  0.8180  0.3609  0.6941  0.6192  
Acc  0.9955  0.9848  0.4948  0.9778  0.9878  
NT_113954.1  Sn  0.3285  0.2226  0.0055  0.2986  0.5420 
Length = 129889  Sp  0.3082  0.2585  0.0021  0.9946  0.2094 
CC  0.3152  0.2369  0.0040  0.4381  0.4382  
Acc  0.9940  0.9940  0.4989  0.9690  0.9894  
NT_113958.2  Sn  0.4555  0.3561  0.2938  0.2716  0.8852 
Length = 209483  Sp  0.4652  0.4439  0.0202  0.9994  0.2880 
CC  0.4527  0.3899  0.0119  0.4996  0.4954  
Acc  0.9849  0.9845  0.4960  0.9532  0.9705  
NT_028395.3  Sn  0.5440  0.4200  0.0000  0.4489  0.8789 
Length = 647850  Sp  0.8233  0.7590  0.0000  0.9947  0.4534 
CC  0.6667  0.5616  0.0116  0.9753  0.6267  
Acc  0.9945  0.9932  0.8710  0.9532  0.9887 
Conclusion
In this article, a new DSPbased technique using SONFs is proposed for the prediction of CGIs in DNA sequences. A novel CPG identification characteristic is presented in the form of a binary basis sequence which is shown to identify CGIs reliably. It has also been shown that the performance of the existing methods which use discriminating transition probability tables for CGIs/nonCGIs is not consistent. The prediction accuracy of these methods are highly dependent on the training data used to obtain the transition probabilities of CGIs and nonCGIs. The inability of finding a unique CGI identification characteristic has resulted in failure in predicting many of the CGIs. This article makes an attempt to present a unique CGI identification characteristic which does not require any training. Furthermore, the ability of SONF to track short duration signals is exploited in identifying the CGIs in DNA sequences. SONF combines maximum signaltonoise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The performance of the proposed technique is tested on four randomly chosen contigs in chromosomes 21 and 22 of human beings. The simulation results comparing the performance of the proposed technique with the other three DSPbased CGI prediction techniques have shown that the proposed approach enjoys superior prediction accuracy in terms of sensitivity. The overall predicting accuracy of the proposed approach is also consistently above 97% and is comparable to that of the Markov chain method making it a reliable method.
Declarations
Acknowledgements
This study was supported in parts by the Natural Sciences and Engineering Research Council (NSERC) of Canada and in part by the Regroupement Strategic en Microelectronique du Quebec (ReSMiQ).
Authors’ Affiliations
References
 Lodish H, Berk A, Zipursky S, Matsudaira P, Baltimore D, Darnell J: Molecular Cell biology. Scientific American, New York,; 1995.
 Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press, Cambridge,; 1998.MATHView Article
 Antequera F, Bird A: Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA 1993, 90(24):1199511999. 10.1073/pnas.90.24.11995View Article
 Antequera F, Bird A: CpG islands as genomic footprints of promoters that are associated with replication origins. Curr. Biol 1999, 9: 661667. 10.1016/S09609822(99)802905View Article
 Ioshikhes I, Zhang M: Largescale human promoter mapping using CpG islands. Nat. Genet 2000, 26: 6163. 10.1038/79189View Article
 Antequera F: Structure, function, evolution of CpG island promoters. Cell. Mol. Life Sci 2003, 60(8):16471658. 10.1007/s0001800330886View Article
 Saxonov S, Berg P, Brutlag D: A genomewide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl Acad. Sci. USA 2006, 103(5):14121417. 10.1073/pnas.0510310103View Article
 Larsen F, Gundersen G, Lopez R, Prydz H: CpG islands as gene markers in the human genome. Genomics (San Diego, CA) 1992, 13(4):10951107.
 Wang Y, Leung F: An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics 2004, 20(7):1170. 10.1093/bioinformatics/bth059View Article
 Bird A: DNA methylation patterns and epigenetic memory. Genes Dev 2002, 16: 621. 10.1101/gad.947102View Article
 Herman J, Baylin S: Gene silencing in cancer in association with promoter hypermethylation. New Engl. J. Med 2003, 349(21):2042. 10.1056/NEJMra023075View Article
 Issa J: CpG island methylator phenotype in cancer. Nat. Rev. Cancer 2004, 4(12):988993. 10.1038/nrc1507View Article
 Illingworth R, Kerr A, DeSousa D, Jorgensen H, Ellis P, Stalker J, Jackson D, Clee C, Plumb R, Rogers J: A novel CpG island set identifies tissuespecific methylation at developmental gene loci. PLoS Biol 2008, 6: e22. 10.1371/journal.pbio.0060022View Article
 Heisler L, Torti D, Boutros P, Watson J, Chan C, Winegarden N, Takahashi M, Yau P, Huang T, Farnham P: CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genome. Nucleic Acids Res 2005, 33(9):2952. 10.1093/nar/gki582View Article
 GardinerGarden M, Frommer M: CpG islands in vertebrate genomes. J. Mol. Biol 1987, 196(2):261. 10.1016/00222836(87)906899View Article
 Rouchka E, Mazzarella R, States David J: Computational detection of CpG islands in DNA, Report: WUCS9739. 1997.
 Rice P, Longden I, Bleasby A: EMBOSS: the European molecular biology open software suite. Trends Genetics 2000, 16(6):276277. 10.1016/S01689525(00)020242View Article
 Ponger L, Mouchiroud D: CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 2002, 18(4):631. 10.1093/bioinformatics/18.4.631View Article
 Dasgupta N, Lin S, Carin L: Sequential modeling for identifying CpG island locations in human genome. IEEE Signal Process. Lett 2002, 9(12):407409.View Article
 LuqueEscamilla P, MartínezAroza J, Oliver J, GómezLopera J, RománRoldán R: Compositional searching of CpG islands in the human genome. Phys. Rev. E 2005, 71(6):61925.View Article
 Bock C, Walter J, Paulsen M, Lengauer T: CpG island mapping by epigenome prediction. PLoS Comput. Biol 2007, 3(6):e110. 10.1371/journal.pcbi.0030110View Article
 Sujuan Y, Asaithambi A, Liu Y: CpGIF: an algorithm for the identification of CpG islands. Bioinformation 2008, 2(8):335338. 10.6026/97320630002335View Article
 Hackenberg M, Previti C, LuqueEscamilla P, Carpena P, MartínezAroza J, Oliver J: CpGcluster: a distancebased algorithm for CpGisland detection. BMC Bioinform 2006, 7: 446. 10.1186/147121057446View Article
 Takai D, Jones P: Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl Acad. Sci 2002, 99(6):37403745. 10.1073/pnas.052410099View Article
 Yoon B, Vaidyanathan P: Identification of CpG islands using a bank of IIR lowpass filters. In Proceedings of 11 th Digital Signal Processing Workshop. Taos Ski Valley, New Mexico; Aug. 2004.
 Rushdi A, Tuqan J: A new DSPbased measure for CpG islands detection. In Digital Signal Processing Workshop, 12thSignal Processing Education Workshop, 4th. IEEE, Teton National Park, Wyoming; 2006.
 Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257286. 10.1109/5.18626View Article
 Won K, PrugelBennett A, Krogh A: Evolving the structure of hidden Markov models. IEEE Trans. Evol. Comput 2006, 10: 3949.View Article
 Anastassiou D: Genomic signal processing. IEEE Signal Process. Mag 2001, 18(4):820. 10.1109/79.939833View Article
 Vaidyanathan P, Yoon B: The role of signalprocessing concepts in genomics and proteomics. J. Franklin Inst 2004, 341(1–2):111135.MATHView Article
 Ramachandran P, Antoniou A: Identification of hotspot locations in proteins using digital filters. IEEE J. Sel. Topics Signal Process 2008, 2(3):378389.View Article
 Rao K, Swamy M: Analysis of genomics and proteomics using DSP techniques. IEEE Trans. Circuits Syst. 1: Regular Papers 2008, 55: 358.MathSciNet
 Song N, Yan H: Short exon detection in DNA sequences based on multifeature spectral analysis. EURASIP J. Adv. Signal Process 2011, 2011: 2. 10.1186/1687618020112View Article
 Liu B: Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press, Boca Raton,; 1998.
 Agarwal R, Plotkin E, Swamy M: Statistically optimal null filter based on instantaneous matched processing. Circuits Syst. Signal Process 2001, 20: 3761. 10.1007/BF01204921MATHMathSciNetView Article
 Kakumani R, Devabhaktuni V, Ahmad M: Prediction of proteincoding regions in DNA sequences using a modelbased approach. In IEEE International Symposium on Circuits and Systems. Seattle; 2008.
 Yadav R, Agarwal R, Swamy M: A new improved modelbased seizure detection using statistically optimal null filter. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE. Minneapolis, Minnesota; 2009.
 Voss R: Evolution of longrange fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett 1992, 68(25):38053808. 10.1103/PhysRevLett.68.3805View Article
 National Centre for Biotechnology Information http://www.ncbi.nlm.nih.gov
 Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353367. 10.1006/geno.1996.0298View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.