Detecting Periodic Genes from Irregularly Sampled Gene Expressions: A Comparison Study
© Wentao Zhao et al. 2008
Received: 29 August 2007
Accepted: 20 May 2008
Published: 4 June 2008
Time series microarray measurements of gene expressions have been exploited to discover genes involved in cell cycles. Due to experimental constraints, most microarray observations are obtained through irregular sampling. In this paper three popular spectral analysis schemes, namely, Lomb-Scargle, Capon and missing-data amplitude and phase estimation (MAPES), are compared in terms of their ability and efficiency to recover periodically expressed genes. Based on in silico experiments for microarray measurements of Saccharomyces cerevisiae, Lomb-Scargle is found to be the most efficacious scheme. 149 genes are then identified to be periodically expressed in the Drosophila melanogaster data set.
The functioning of eukaryotic cells is controlled by accurate timing of biological cycles, such as cell cycles and circadian rhythms. These are composed of an echelon of molecular events and checkpoints. At the transcription level, these events can be quantitatively observed by measuring the concentration of messenger RNA (mRNA), which is transcribed from DNA and serves as the template for synthesizing protein. To achieve this goal, in the microarray experiments, high-throughput gene chips are exploited to measure genome-wide gene expressions sequentially at discrete time points. These time series data have three characteristics. Firstly, most data sets are of small sample size, usually not more than 50 data points. Large sample sizes are not financially affordable due to high cost of gene chips. Also the cell cultures lose their synchronization and render meaningless data after a period of time. Secondly, the data are usually evenly sampled and have many time points missing. Thirdly, most data sets are customarily corrupted by experimental noise and the produced uncertainty should be addressed in a stochastic framework.
Extensive genome-wide time course microarray experiments have been conducted on organisms such as Saccharomyces cerevisiae (budding yeast) , human Hela , and Drosophila melanogaster (fruit fly) . Budding yeast in  has served as the predominant data source for various statistical methods in search of periodically expressed genes, mainly due to its pioneering publication and relatively larger sample size compared with its peers. By assuming the signal in the cell cycle to be a simple sinusoid, Spellman et al.  and Whitfield et al.  performed a Fourier transformation on the data sampled with different synchronization methods, while Giurcaneanu  explored the stochastic complexity of the detection mechanism of periodically expressed genes by means of generalized Gaussian distributions. Ahdesmäki et al.  implemented a robust periodicity testing procedure also based on the non-Gaussian noise assumption. Alternatively, Luan and Li  employed guide genes and constructed cubic B-spline-based periodic functions for modeling, while Lu et al.  employed up to three harmonics to fit the data and proposed a periodic normal mixture model. Power spectral density estimation schemes have also been employed. Wichert et al.  applied the traditional periodogram on various data sets. Bowles et al.  compared Capon and robust Capon methods in terms of their ability to identify a predetermined frequency using evenly sampled data sets, under the assumption of a known period. Lichtenberg et al.  compared  while proposing a new score by combining the periodicity and regulation magnitude. The majority of these works dealt with evenly sampled data. When missing data points were present, either the vacancies were filled by interpolation in time domain, or the genes were discarded if there were more than 30% data samples missing.
Biological experiments generally output unequally spaced measurements. The major reasons are experimental constraints and event-driven observation. The rate of measurement is directly proportional to the occurrence of events. Therefore, an analysis based on unevenly sampled data is practically desired and technically more challenging. While providing modern spectral estimation methods for stationary processes with complete and evenly sampled data , the signal processing literature has witnessed an increased interest in analyzing unevenly sampled data sets, especially in astronomy, in the last decades. The harmonics exploited in discrete Fourier transform (DFT) are no longer orthogonal for uneven sampling. However, Lomb  and Scargle  demonstrated that a phase shift suffices to make the sine and cosine terms orthogonal. The Lomb-Scargle scheme has been exploited in analyzing the budding yeast data set by Glynn et al. . Schwarzenberg-Czerny  employed one-way analysis of variance (AoV) and formulated an AoV periodogram as a method to detect sharp periodicities. However, it relies on an infeasible biological assumption, that is, the observation duration covers many cycles. Along with this line of research, Ahdesmäki et al.  proposed to use robust regression techniques, while Stoica and Sandgren  updated the traditional Capon method to cope with the irregularly sampled data. Wang et al.  reported a novel technique, referred to as the missing-data amplitude and phase estimation (MAPES) approach, which estimates the missing data and spectra iteratively through the expectation maximization (EM) algorithm. In general, Capon and MAPES methods possess a better spectral resolution than Lomb-Scargle periodogram. In this paper, we propose to analyze the performance of three of the most representative spectral estimation methods: Lomb-Scargle periodogram, Capon method, and the MAPES technique in the presence of missing samples and irregularly spaced samples. The following questions are to be answered in this study: do technically more sophisticated schemes, such as MAPES, achieve a better performance on real biological data sets than on simpler schemes? Is the efficiency sacrificed in using these advanced methods justifiable?
The remainder of this paper is structured as follows. In Section 2, we introduce the three spectral analysis methods, that is, Lomb-Scargle, Capon and MAPES. Hypothesis tests for periodicity detection and the corresponding -values are also formulated. The multiple testing correction is discussed. Section 3 presents simulation results. The performances of the three schemes are compared based on published cell-cycle and noncell-cycle genes of the Saccharomyces cerevisiae (budding yeast). Then the spectral analysis for the data set of Drosophila melanogaster (fruit fly) is performed, and a list of 149 genes are presented as cycle-related genes. The synchronization effects are also considered. Concluding remarks and future works constitute the last section, and full results are provided online in the supplementary materials .
In this section, the Lomb-Scargle periodogram, Capon method, and MAPES approach are introduced and compared in terms of their features and implementation complexity. The detailed derivations are omitted. As a general notational convention, matrices and vectors are represented in bold characters, while scalars are denoted in regular fonts.
2.1. Lomb-Scargle Periodogram
The deployment of Fourier transform and traditional periodogram relies on evenly sampled data, which are projected on orthogonal sine and cosine harmonics. The uneven sampling ruins this orthogonality. Hence, the Parseval's theorem fails, and there exists a power discrepancy between the time and frequency domains. When analyzing astronomical data, which in general are collected at uncontrollable observation times, Lomb  found that a phase shift of the sine and cosine functions would restore the orthogonality among harmonics. Scargle  complemented the Lomb's periodogram by exploiting its distribution. Since then, the established Lomb-Scargle periodogram has been exploited in numerous fields and applications, including bioinformatics and genomics (see, e.g., Glynn et al. ).
Notice further that the spectra on the front and rear halves of the frequency grid are symmetric since the microarray experiments output real values.
Lomb-Scargle periodogram represents an efficient solution in estimating the spectra of unevenly sampled data sets. Our simulation results also verify its superior performance for biological data with small sample size and various unevenly sampled patterns.
2.2. Capon Method
Note that we have not included in this spectrum estimate a scaling factor. However, the absence of this scaling factor does not affect periodicity analysis for the genes. Therefore, we neglect this scaling factor. The bandwidth parameter cannot exceed to guarantee an existing . The larger the , the better the resolution of the obtained spectra.
The Capon method is slightly more computationally complex than Lomb-Scargle periodogram, and it usually achieves a better performance in terms of resolution provided that there are sufficient samples. However, for highly corrupted biological data with small sample size, this is not true.
2.3. MAPES Method
Regular sampling can be treated as a case of missing data as long as the sampling time tags share a greatest common divisor. This constraint is satisfied in most biological experiments and published data sets. The missing-data amplitude and phase estimation (MAPES) method, proposed in , is a nonparametric spectral estimation approach. It is robust to error modeling and it deals with arbitrary data-missing patterns as opposed to gapped or periodically gapped data, and achieves a better spectral resolution in the sense of resolving closely spaced spectral lines. However, the exploitation of the expectation maximization (EM) algorithm sacrifices its computational efficiency.
where represents the complex amplitude of the sinusoidal component and denotes the residual term. The probing frequencies still follow (6). Employing the EM algorithm, MAPES tries to iteratively assess the missing data, and meanwhile to update the estimation of spectra and error.
In (19), are subblock matrices located on the main diagonal of matrix .
Actually, in our in silico experiments, assuming , MAPES yields an estimate of power spectral about two orders of magnitude more computational time (roughly about one hundred times slower) than Lomb-Scargle and Capon methods. Also, the simulation results do not indicate any performance improvement for MAPES in terms of the ability to discover published cell cycle genes. A more detailed comparison between these schemes will be presented in the simulation section.
2.4. Periodicity Test
A rejection of the null hypothesis based on a -value threshold implies that the power spectral density contains a frequency with magnitude substantially greater than the average value. This indicates that the time series data contain a periodic signal and the corresponding gene is cyclic in expression. Notice also that a more accurate estimation method for the -values can be found in Fisher  or Brockwell and Davis . The rank of genes ordered by their -values is of additional importance and it helps to hedge the risk of dichotomous decisions.
For the Lomb-Scargle periodogram, is exponentially distributed under the null hypothesis , a result which is also exploited in . However, this exponential distribution is not applicable for a general power spectral density. Therefore, Fisher's test is employed to perform the comparison among different spectral schemes. Our simulation results also show that for Lomb-Scargle periodogram, the gene ranks generated by Fisher's test do not differ much from that produced by the exponential distribution. Finally, we remark that other periodicity detection tests exist, as indicated by the robust Fisher test , the likelihood ratio test, and the test .
2.5. Multiple Testing Correction
where the numerator is an estimate of the number of false positives. Since generally periodic genes only occupy a small portion of all genes, the is set to directly in our simulation. Such an action brings a slightly larger estimate. There exist other statistical methods to estimate , for example, .
3. Simulation Results
Our in silico experiments are first performed on the Saccharomyces cerevisiae (budding yeast) data set. The Lomb-Scargle, Capon, and MAPES are compared. Then we proceed to analyze the Drosophila melanogaster (fruit fly) data set.
3.1. Simulation on Saccharomyces Cerevisiae
The performance of the three schemes is evaluated based on the Saccharomyces cerevisiae (budding yeast) data set reported by Spellman et al. . In the biological experiments, the mRNA concentrations of more than 6 000 open reading frames (ORFs) were measured for the yeast strains synchronized by using four different methods, namely, factor, cdc15, cdc28, and elutriation. The data set contained 73 sampling points, while there existed missing observations for some genes.
The literature has provided prior knowledge about the yeast cell cycle genes: Spellman et al.  enumerated 104 cell cycle genes that were verified in previous biological experiments, while Lichtenberg et al.  summarized 105 genes that were not involved in the cell cycle. By exploiting these two control sources, we can evaluate the true and false positives generated by the three spectral estimation methods.
The comparison procedure is as follows: based on the given data set, the three schemes perform to preserve a prespecified number of genes. These genes are marked as cell cycle genes and are compared with two control gene sets, from which the number of positives are counted. If a preserved gene also exists in the gene set which has been verified to be cell cycle regulated, this hit is counted as a true positive. On the other hand, if the preserved gene appears in the gene set which has been corroborated to be not involved in the cell cycle, this hit is counted as a false positive. Notice that since we expect the noncell cycle genes to be the majority of all measured genes, but the verified noncell cycle genes are only a small portion of all the genes, the false positives from verified noncell cycle genes only provide a reference but not a significant knowledge of the false positives. Because the three algorithms perform similarly for all four data sets, only simulation outcomes for cdc15 are presented here to exemplify the general results. The cdc15 data set contained 24 time points sampled from minutes to minutes. The greatest common divisor (gcd) for all time intervals is minutes. Therefore and . The bandwidth of Capon method is 14 while the subvector length of MAPES is equal to . All three schemes, that is, Lomb-Scargle, Capon, and MAPES, are applied on the data set.
Above all, Lomb-Scargle scheme always identifies the largest number of cell cycle genes that have been verified in previous biological experiments. Due to its simplicity, we recommend the use of this simplest method.
3.2. Simulation on Drosophila Melanogaster
The Drosophila melanogaster (fruit fly) is selected as our research target because it is a well-studied, relatively simple organism with a short generation time and only 4 pairs of chromosomes. In addition, 75% of human diseases have their counterparts in fruit fly, and 50% of fruit fly proteins have their mammalian analogs . These make the fruit fly an excellent model for the research of human diseases. In the literature for the fruit fly, most of the research work was conducted through experimental biological methods, and the computational analysis tools have not been fully explored for the detection of periodically expressed genes. Our in silico experiments are performed on the fruit fly data set published by Arbeitman et al. . With the usage of cDNA microarrays, the RNA expression levels of 4028 genes were measured. These stand for about one third of all found fruit fly genes.
In Arbeitman's experiments, 75 sequential sampling points were observed, starting right after fertilization and through embryonic, larval, pupal, and early days of adulthood. The time series data during the embryonic stage are analyzed. The embryonic stage gives us insight into the developmental process, that is, how the fruit fly grows from a zygote to a complex organism with cell specialization. The embryonic data takes the instant of egg lay as the time origin. 30 time points were sampled from hour to hours. The greatest common divisor (gcd) for all time intervals is hour. Therefore and . The best candidate, Lomb-Scargle, is applied on the data set.
The top 149 genes with the smallest -values are selected and conferred to be periodic with the highest confidence. To remove the effects of DC component, the first two frequency probes are filtered out. The -value is controlled to be less than 0.2. The detailed results are organized into a spreadsheet and provided in the supplementary materials . The majority of genes are associated with a periodicity of about 20 hours, we hypothesize that a portion of them are related to the circadian rhythm. The cell cycle genes are not fully detectable because in the embryonic stage the cells proliferate very fast in minutes, however the implemented sampling rate was not fast enough to capture the phenomenon in the cell cycle.
3.3. Discussion of Synchronization Effects
In order to measure a valid sample, the cell culture has to be synchronized, in other words, all cells within the culture should be homogeneous in all aspects, for example, cell size, DNA, RNA, protein, and other cellular contents, and should also mimic the unperturbed cell cycle. Cooper in  argued that the ideal synchronization is a mission impossible due to the different dimensions, like cell size and DNA content, that cannot be controlled at the same time. Therefore, current popular synchronization methods, like serum starvation and thymidine block, are only one-dimensional synchronization techniques and fail to achieve a truly global synchronization. Cooper also argued it was fully possible that the discovered periodicity was completely caused by chance or by the specific synchronization method employed. The available fruit fly data set was sampled with the synchronization yielded by the Cryonics method. Cryonics is the low-temperature preservation method of tissues in which all cell activities are believed to be halted. The cells frozen with liquid nitrogen are compared with control cells, that were formaldehyde fixed, to ensure that the cells were at the expected developmental stages during sampling. This synchronization method differentiates itself from the one-dimensional methods employed in , which have been shown in  to present cell cultures that are not actually representative of the cell cycle. Though the damage caused by the freezing was not known, the fly's development assumed true synchronization with the control cells at every developmental check point. This provided enough evidence to consider Arbeitman's data set out of the scope of the issues raised in . Therefore, one can claim with confidence that any discovered periodicity will not have risen from chance fluctuations alone.
Three of the most representative spectral analysis methods, namely, Lomb-Scargle, Capon, and missing-data amplitude and phase estimation (MAPES) methods, are compared in terms of their performance for detecting the periodically expressed genes in Saccharomyces cerevisiae. Lomb-Scargle and Capon methods are computationally efficient while MAPES involves extensive matrix calculations and the iterative expectation maximization (EM) step. Our in silico experiments revealed that the simplest method, Lomb-Scargle, outperforms more sophisticated Capon and MAPES. Compared with the other two, Lomb-Scargle method is able to identify more published cyclic genes. This discrepancy between methods is mainly attributed to the data features, such as the small sample size, large proportion of missing samples, and samples highly corrupted by noise. In addition, the computational complexity sacrificed in MAPES for achieving high resolution is not justifiable in the context of gene microarray data. Thus, the computationally simpler methods are more fit for the small sample size scenarios.
The computational results also provide novel insights into the data reported by Drosophila melanogaster experiments. A list of 149 genes are identified to express periodically. Their relation with the biological processes are yet to be validated. Our future works also include the development of a comprehensive time-frequency analysis framework for time series microarray data. The small sample size represents another great challenge. Besides, a cross-species study is also desired to examine the relations between fruit fly and homosapiens genes.
This work is supported by the USA National Cancer Institute (CA-90301) and the National Science Foundation (ECS-0355227 and CCF-0514644).
- Spellman PT, Sherlock G, Zhang MQ, et al.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998,9(12):3273-3297.View ArticleGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha AJ, et al.: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell 2002,13(6):1977-2000. 10.1091/mbc.02-02-0030.View ArticleGoogle Scholar
- Arbeitman MN, Furlong EEM, Imam F, et al.: Gene expression during the life cycle of Drosophila melanogaster . Science 2002,297(5590):2270-2275. 10.1126/science.1072152View ArticleGoogle Scholar
- Giurcaneanu CD: Stochastic complexity for the detection of periodically expressed genes. Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '07), Tuusula, Finland, June 2007 1-4.Google Scholar
- Ahdesmäki M, Lähdesmäki H, Pearson R, Huttunen H, Yli-Harja O: Robust detection of periodic time series measured from biological systems. BMC Bioinformatics 2005, 6, article 117: 1-18.Google Scholar
- Luan Y, Li H: Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics 2004,20(3):332-339. 10.1093/bioinformatics/btg413View ArticleGoogle Scholar
- Lu X, Zhang W, Qin ZS, Kwast KE, Liu JS: Statistical resynchronization and Bayesian detection of periodically expressed genes. Nucleic Acids Research 2004,32(2):447-455. 10.1093/nar/gkh205View ArticleGoogle Scholar
- Wichert S, Fonkianos K, Strimmer K: Identifying periodically expressed trascripts in microarry time series data. Bioinformatics 2004,20(1):5-20. 10.1093/bioinformatics/btg364View ArticleGoogle Scholar
- Bowles T, Jakobsson A, Chambers J: Detection of cell-cyclic elements in mis-sampled gene expression data using a robust Capon estimator. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), Montreal, Canada, May 2004 5: 417-420.Google Scholar
- de Lichtenberg U, Jensen LJ, Fausbøll A, Jensen TS, Bork P, Brunak S: Comparison of computational methods for the identification of cell cycle-regulated genes. Bioinformatics 2005,21(7):1164-1171. 10.1093/bioinformatics/bti093View ArticleGoogle Scholar
- Stoica P, Moses RL: Introduction to Spectral Analysis. Prentice Hall, Upper Saddle River, NJ, USA; 1997.MATHGoogle Scholar
- Lomb NR: Least-squares frequency analysis of unequally spaced data. Astrophysics and Space Science 1976,39(2):447-462. 10.1007/BF00648343View ArticleGoogle Scholar
- Scargle JD: Studies in astronomical time series analysis—II: statistical aspects of spectral analysis of unevenly spaced data. The Astrophysics Journal 1982, 263: 835-853.View ArticleGoogle Scholar
- Glynn EF, Chen J, Mushegian AR: Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics 2006,22(3):310-316. 10.1093/bioinformatics/bti789View ArticleGoogle Scholar
- Schwarzenberg-Czerny A: On the advantage of using analysis of variance for period search. Monthly Notices of the Royal Astronomical Society 1989, 241: 153-165.View ArticleGoogle Scholar
- Ahdesmäki M, Lähdesmäki H, Gracey A, Shmulevich I, Yli-Harja O: Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data. BMC Bioinformatics 2007, 8, article 233: 1-16.Google Scholar
- Stoica P, Sandgren N: Spectral analysis of irregularly-sampled data: paralleling the regularly-sampled data approaches. Digital Signal Processing 2006,16(6):712-734. 10.1016/j.dsp.2006.08.012View ArticleGoogle Scholar
- Wang Y, Stoica P, Li J, Marzetta TL: Nonparametric spectral analysis with missing data via the EM algorithm. Digital Signal Processing 2005,15(2):191-206. 10.1016/j.dsp.2004.10.004View ArticleGoogle Scholar
- "Supplementary Materials", prepared in Microsoft Excel[http://www.ee.tamu.edu/~wtzhao/Research.html]
- Eyer L, Bartholdi P: Variable stars: which Nyquist frequency? Astronomy and Astrophysics Supplement Series 1999,135(1):1-3. 10.1051/aas:1999102View ArticleGoogle Scholar
- Fan J, Yao Q: Nonlinear Time series: Nonparametric and Parametric Methods. Springer, New York, NY, USA; 2003.View ArticleMATHGoogle Scholar
- Fisher RA: Tests of significance in harmonic analysis. Proceedings of the Royal Society of London. Series A 1929,125(796):54-59. 10.1098/rspa.1929.0151View ArticleMATHGoogle Scholar
- Brockwell PJ, Davis RA: Time Series Theory and Methods. 2nd edition. Springer, New York, NY, USA; 1987.View ArticleMATHGoogle Scholar
- Ahdesmaki M, Lahdesmaki H, Yli-Harja O: Roubust Fisher's test for peridocity detection in noisy biological time series. Proceedings of IEEE international Workshop on Genomic Signal Processing and Statistics (GENSIPS '07), Tuusula, Finland, June 2007Google Scholar
- Storey JD: A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B 2002,64(3):479-498. 10.1111/1467-9868.00346MathSciNetView ArticleMATHGoogle Scholar
- Storey JD:The positive false discovery rate: a Bayesian interpretation and the -value. Annals of Statistics 2003,31(6):2013-2035. 10.1214/aos/1074290335MathSciNetView ArticleMATHGoogle Scholar
- de Lichtenberg U, Wernersson R, Jensen TS, et al.: New weakly expressed cell cycle-regulated genes in yeast. Yeast 2005,22(15):1191-1201. 10.1002/yea.1302View ArticleGoogle Scholar
- Reiter LT, Potocki L, Chien S, Gribskov M, Bier E: A systematic analysis of human disease-associated gene sequences in Drosophila melanogaster . Genome Research 2001,11(6):1114-1125. 10.1101/gr.169101View ArticleGoogle Scholar
- Cooper S: Rethinking synchronization of mammalian cells for cell cycle analysis. Cellular and Molecular Life Sciences 2003,60(6):1099-1106.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.