TRII: A Probabilistic Scoring of Drosophila melanogaster Translation Initiation Sites
 Michael P. Weir^{1}Email author and
 Michael D. Rice^{2}
DOI: 10.1155/2010/814127
© M. P.Weir and M. D. Rice. 2010
Received: 29 April 2010
Accepted: 14 October 2010
Published: 18 October 2010
Abstract
Relative individual information is a measurement that scores the quality of DNA and RNAbinding sites for biological machines. The development of analytical approaches to increase the power of this scoring method will improve its utility in evaluating the functions of motifs. In this study, the scoring method was applied to potential translation initiation sites in Drosophila to compute Translation Relative Individual Information (TRII) scores. The weight matrix at the core of the scoring method was optimized based on highconfidence translation initiation sites identified by using a progressive partitioning approach. Comparing the distributions of TRII scores for sites of interest with those for highconfidence translation initiation sites and random sequences provides a new methodology for assessing the quality of translation initiation sites. The optimized weight matrices can also be used to describe the consensus at translation initiation sites, providing a quantitative measure of preferred and avoided nucleotides at each position.
1. Introduction
Understanding how biological machines work in the context of genomes, transcriptomes, and proteomes requires appropriate languages and representations for successful modeling of their biological processes. Information theory provides one of the foundations for this goal and underlies sequence motiffinding algorithms such as MEME [1]. For example, information theory gives us powerful ways to analyze and score sequence motifs in RNAs that are targeted by biological machines such as the spliceosome or ribosome [2–4]. The approach reveals, for each nucleotide position in the motif, which nucleotide choices are preferred and which are avoided. For any single RNA sequence, the collective deviations from the preferred nucleotides must be sufficiently small for the machine to successfully function on that RNA.
The summation represents the uncertainty based on the frequencies of occurrence of the nucleotides at position . The sampling correction factor depends on and decreases toward 0 as the value of increases [3].
where is the background frequency of nucleotide in a selected set of sequences.
where denotes the frequency of occurrence of nucleotide at position in the set , and denotes the sampling correction factor discussed above. In essence, the reference set is used to create a weight matrix of values which are used to calculate the individual information score based on which nucleotide is present at each position in the test sequence . The more representative the reference sequences used to construct the weight matrix, the better the dynamic range of the individual information scoring system: sequences with a good match to a motif will have higher scores, and sequences with poorer matches will have lower scores (see discussion of matrix optimization below).
where is the background frequency of nucleotide . For example, when relative individual information is used to score splice sites [3], background nucleotide frequencies based on the full set of cDNAs were used.
Relative individual information scoring of individual DNA and RNA sequences has been discussed previously [7], and forms the basis for motif finding algorithms such as MEME [1] which are based on Markov models that encapsulate the notion of individual information. In this study, we developed methods to use relative individual information to score translation initiation sites using Drosophila as a model system. When applied to translation initiation, we refer to relative individual information scores as TRII scores (Translation Relative Individual Information). As presented below, the ability to score individual sequences presents an opportunity to analyze distributions of TRII scores for sets of sequences of interest. By appropriate choices of control test TRII score distributions, this approach allows one to interpret score distributions for sites of interest in a probabilistic manner. Analysis of score distributions provides insights into translation initiation: potential initiation sites with TRII scores that resemble highconfidence start sites can be considered likely initiation sites whereas sites similar to random sequences are likely to be weak or nonfunctional for translation initiation. We also discuss how the methods described in this paper can be applied to the initiation context scoring method of Miyasaka [8] which has been used, for example, to predict and score translation initiation sites in a recent ribosome profiling study based on deep sequence analysis in yeast [9]. In contrast to TRII scoring, which measures deviations from background frequencies at each nucleotide position (4), the Miyasaka method is based on deviations from the preferred nucleotide at each position.
2. Results and Discussion
2.1. Identification of HighConfidence Translation Initiation Sites
An initial goal of this analysis was to define sets of highconfidence translation start sites whose TRII score distributions could be used as standards for analysis of TRII score distributions of other test sets. Previous studies have tended to rely on "curated" gene sets to define training sets of highconfidence translation initiation sites. Instead, we developed a bioinformatics approach to identify large sets of initiation sites in which we could have high confidence.
We hypothesized that the depressed relative information levels at annAUGs associated with upAUGs might be explained by the presence of annAUGs that are weak or nonfunctional translation initiation sites. For example, weak or nonfunctional annAUG sites might be expected if there is translation initiation at upAUGs followed by translation reinitiation [14–16] at annAUGs or downstream AUGs. To investigate this further, the distributions of relative individual information scores were examined for subsets of cDNAs with different numbers of upAUGs. We assessed whether the subsets of cDNAs with different numbers of upAUGs were essentially a mixture of two classes of annAUGs: (i) higherscoring, likely functional translation start sites and (ii) lowerscoring, weak, or nonfunctional start sites.
The translation relative individual information (TRII) scores were calculated using a reference set which we define as the set of cDNAs whose 5^{'}UTRs contain at least 200 nucleotides (denoted 5^{'}UTR ≥ 200; see Supplementary Table 6 for summary of sequence sets used in this study available online at: doi:10.1155/2010/814127). Because ribosomes are hypothesized to scan 5^{'}UTRs to identify translation initiation sites, we used the nucleotide frequencies in the 5^{'}UTRs of a set of 8,607 cDNAs as background frequencies. The weight matrix is based on these background frequencies and nucleotide positions −20 to 20 relative to the annAUGs in . This range of positions is used throughout the paper to define weight matrices and to score test sequences.
UpAUG Analysis
Number of upAUGs  Number of cDNAs  Random curve (%)  0upAUG curve (%) 

1  502  6  94 
2 or 3  812  13  87 
4 or 5  695  24  76 
6 to 9  487  31  69 
≥10  687  51  49 
The relative individual information distribution for the 0upAUG set suggests it has the least contamination with weak or nonfunctional annAUGs, compared to sets of cDNAs with upAUGs in their UTRs (Figure 2 and data not shown). We conclude that identification of 0upAUG sets provides a convenient informaticsbased method for computing sets of highconfidence translation initiation sites.
2.2. Optimizing the Choice of the Reference Set
These sets of highconfidence translation initiation sites were used to improve the TRII scoring approach in two ways: (i) to modify the weight matrices that underpin the TRII scoring method, and (ii) to provide control test score distributions for assessment of scores. We first discuss optimization of the weight matrix. Up to this point, we have used the full set of cDNAs with 5^{'}UTR ≥ 200 as a reference set to construct the weight matrix for computing relative individual information scores. Because the 0upAUG set consisting of 446 sequences appears to have least contamination with weak or nonfunctional start annAUGs, we explored using it instead as an optimized highconfidence reference set . Henceforth, we reserve the notation and for 0upAUG sets with 5^{'}UTRs ≥ 200 or between 100 and 199, respectively.
The use of 0upAUG reference sets is supported by our testing of the TRII score method in budding yeast (Supplementary Figures 5 and 6). Protein expression and ribosome densities have been measured for most yeast genes [17, 18]. For highly expressed genes, we observed a correlation between TRII scores and protein expression levels or ribosome densities, and these correlations were stronger when a 0upAUG reference set is used to compute the TRII scores (see Supplementary Material S.6).
2.3. Validating Control Test Distributions
Using the improved weight matrices, we assessed the effectiveness of using score distributions of 0upAUG sets as control test distributions for analysis of TRII scores. Comparisons of 0upAUG distributions with distributions for sets of translation initiation sites from the Drosophila genome project support the use of 0upAUG sets as representative of functional initiation sites. The Berkeley Drosophila Genome Project (BDGP) cDNA sequence set was constructed by sequencing highquality, fulllength cDNA libraries. The annotated ORFs and annAUGs were determined by finding the longest ORF encoded by each cDNA. The sequenced cDNAs (copies of mRNAs), which are part of the Drosophila Genome Project, can be compared with the set of annotated genes and their transcripts that has been assembled based initially on gene prediction algorithms. A subset of the cDNA ORFs that matched ORFs of annotated transcripts in the Release 3 Drosophila genome were designated by BDGP as a "Gold collection" [11]. Gold collection ORFs were considered to be highquality because they were both predicted in the genome and found in cDNAs. Comparison of the TRII score distributions for the full gold collection of cDNAs with 5^{'}UTR ≥ 200 (red curve, Figure 4(a)) and the full set of Release 5.9 predicted genes with 5^{'}UTR ≥ 200 (green curve) reveals strikingly similar distributions. This is consistent with gold collection cDNAs being viewed as representative of current annotated gene models. The TRII score distributions for the Gold collection and Release 5.9 predicted genes are both similar to the score distribution for the 0upAUG set of cDNAs (blue curve), except that both have slightly greater frequencies of lowscoring start sites. We partitioned the Gold set cDNAs with 5^{'}UTR ≥ 200 into two test subsets: those with no upAUGs, and those with 1 or more upAUGs. The 300 0upAUG cDNAs in the Gold set have a distribution of TRII scores that is very similar to the distribution of the scores using as a test set (red and blue curves, respectively, Figure 4(b)). These observations support the conclusion that the 0upAUG annAUGs represent a highconfidence set of translation initiation sites and that various sets of 0upAUG sites are appropriate to use for control test curves of TRII scores.
In this analysis, we noticed a disparity between TRII score distributions for experimentally observed cDNAs not in the Gold collection compared to Gold collection cDNAs that match predicted transcripts. TRII score distributions were compared using chisquare goodness of fit tests (Supplementary Material S.2.1). Various subsets of these "nongold" cDNAs (Figure 4) with at least one upAUG showed many more lowscoring annAUGs than their Gold counterparts, even though the nongold cDNAs appear to represent authentic mRNAs (see Figure 4 legend). The fact that nongold cDNAs represent mRNAs not in the predicted transcriptome suggests that the algorithms used to predict the Drosophila transcriptome prior to incorporation of cDNA data were conservative and failed to predict significant numbers of experimentally observed transcripts including mRNAs with upAUGs and lowscoring annAUGs.
2.4. Applications of Optimized TRII Scoring
We assessed the optimized TRII scoring method by analyzing the distributions of several special sets of interest in order to (1) assess upstream AUGs through comparisons with control distributions, and (2) assess nonconserved annAUGs using linear combinations of control curves.
2.4.1. Upstream AUGs
As noted previously, many cDNAs have upAUGs in their 5^{'}UTRs. We examined the TRII score distribution for the set of first AUGs upstream of the annAUG in gold collection cDNAs containing upAUGs (with 5^{'}UTR ≥ 200). The distribution of TRII scores (green curve, Figure 5) was very similar to the random AUG set distribution (grey curve) suggesting that the upAUGs are generally weak or nonfunctional translation initiation sites.
Nucleotide position −3 plays a central role in defining the consensus motif for translation initiation in Drosophila (see the final section on defining motifs). We observed that 57.6% of the upAUGs have C or U at this position, in contrast to only 7.6% of the annAUGs in the 0upAUG set. Given that 47.5% of random sequences have C or U at this position (consistent with the background frequencies in 5^{'}UTRs of 22.6% and 24.8% for C and U, resp.), this suggests that there may be some selection in favor of C or U at this position to reduce the likelihood of translation initiation at upAUGs. These observations suggest that the random sequence set is an appropriate comparison set to represent weak or nonfunctional AUGs in analysis of TRII score distributions.
2.4.2. Nonconserved annAUGs
The TRII score distributions for the 0upAUG set of cDNAs and for the set of random sequences provide useful control test curves for assessing special sets of annAUGs. Linear combination of these control curves can be useful in cases where experimental distributions are intermediate between them. For example, we measured TRII scores for a set of annAUGs considered highly likely to be misannotated (red curve, Figure 6). These suspect annAUGs were marked for reannotation (Lin and Kellis, personal communication [19–21]) because their annAUG and downstream codons are not well conserved in 11 other Drosophila species that have been sequenced. The TRII score distribution for the suspect Drosophila melanogaster annAUGs was compared with the score distributions for and . The relative individual information scores were calculated using the reference set .
As illustrated in Figure 6, the score distribution of the suspect set of annAUGs shows some similarity to the distribution for random sequences surrounding the AUG. This strongly supports the conclusion that many of the suspect annAUGs are either weak or nonfunctional translation initiation sites.
In order to estimate the fraction of suspect annAUGs with randomlike sequence context, we used a curve reconstruction approach. We compared the observed TRII score distribution of the suspect set (Figure 6, red curve) to a composite distribution (green curve) derived from the 0upAUG (blue) and random (grey) curves combined in a ratio of 0.31 : 0.69. This ratio was chosen to minimize the sum of squares of differences between the corresponding values in the test (red) and composite (green) curves. Our analysis suggests that approximately 70% of the suspect annAUGs are misannotated or underannotated and about 30% are not misannotated. Therefore, while the majority of genes are correctly reannotated, some nonconserved annAUGs might be reannotated inappropriately based upon conservation assessment. This analysis illustrates the potential utility of reconstructing TRII score distributions as a linear combination of distributions for highconfidence (0upAUG) and random sequences.
2.5. Estimating Confidence Intervals Using TRII Scores
Score thresholds
 .05  .10  .50  .90  .95 

TRIIthreshold_{random}  −1.67  −0.56  3.19  6.82  7.75 
TRIIthreshold_{0upAUG}  3.71  4.89  8.40  10.74  11.27 
Conditional probabilities for classification.
(a)  

s  (start) 
 .00 
−4  .00 
−3  .00 
−2  .00 
−1  .01 
0  .02 
1  .02 
2  .02 
3  .04 
4  .07 
5  .15 
6  .25 
7  .36 
8  .49 
9  .66 
10  .82 
11  .92 
12  .97 
≥13  1.00 
(b)  
s  (random) 
≤−5  1.00 
−4  .99 
−3  .98 
−2  .94 
−1  .90 
0  .82 
1  .72 
2  .60 
3  .46 
4  .33 
5  .21 
6  .12 
7  .06 
8  .03 
9  .01 
10  .00 
11  .00 
12  .00 
≥13  .00 
In our analysis above of annAUGs that were flagged as possibly misannotated due to poor conservation across species (Figure 6), 40% of the suspect annAUGs had scores below 3.7 bits, and only 19% of the suspect annAUGs have scores above 7.7 bits. The remaining 41% of the annAUGs had scores in the confidence interval between these thresholds.
The weight matrix used to calculate the TRII scores is provided in Supplementary Material S.3 and may be used to calculate scores for any AUG of interest. The TRII scores can also be calculated using a graphical user interface found at http://igs.wesleyan.edu/ > Databases and Tools > Information Theoretic Analysis (see Methods). The set of reference sequences used to construct the weight matrix is provided in Supplementary Material S.1. The TRII scores for annAUGs of all predicted transcripts in the Release 5.9 Drosophila melanogaster genome are also provided in Supplementary Material S.1.
In Table 3(a), we extend the analysis presented in Table 2 and Figure 7 to estimate the conditional probabilities, based on the distribution of TRII scores for , that a test sequence is a start site if it has a given TRII score or lower. Similarly, in Table 3(b), we estimate the conditional probabilities that a test sequence is random, and therefore weak or nonfunctional, if it has a given TRII score or higher. The latter conditional probabilities are based on the distribution of TRII scores for . Tables 3(a) and 3(b) provide a convenient summary for interpreting the TRII scores in Supplementary Material S.1.
The significant overlap in the TRII score distributions for random sequences and highconfidence initiation sites makes it necessary to treat intermediate TRII scores probabilistically as discussed above. Even though the distributions overlap, the TRII score measure can contribute to future algorithms for assessment of translation initiation in combination with other classifiers that incorporate properties such as RNA structure prediction [22] and sequence conservation [20].
The methods discussed to optimize TRII scoring—the utilization of highconfidence sets and probabilistic analysis of score distributions—can also be applied to the initiation context scoring method of Miyasaka [8]. The latter method has been used, for example, to predict and score translation initiation sites in a recent ribosome profiling study based on deep sequence analysis in yeast [9]. The Miyasaka method differs significantly from the TRII scoring approach since it uses a weight matrix of nucleotide frequency ratios computed relative to the frequency of the single most abundant nucleotide at each position. In contrast, each weight matrix entry for TRII scoring is the log of the nucleotide frequency at a position relative to the background frequency for that nucleotide (4). Both scoring methods give analogous score distributions for and allowing probabilistic assessment of scores (data not shown). However, the TRII scoring method has the advantage that it measures more transparently the deviations from background nucleotide frequencies that have been selected during evolution of functional sites.
2.6. Defining Motifs Using a Consensus Matrix
where denotes " and not and not ". Using this approach, a weight indicates that and a weight ≤−0.5 indicates that . Hence, the "consensus" that is defined represents nucleotides whose frequencies are at least 1.41 fold higher than their background frequency. Similarly, the "not N" consensus choices have frequencies that are at least 1.41 fold lower than background. Defining the consensus measure based on deviations from background frequencies provides a natural indication of the nucleotide preferences of the translation machinery. Indeed, the most pronounced deviations are for and at position −3 (6.5 and 17.7 fold lower than background, resp.), indicating that the presence of either of these pyrimidine nucleotides at this position is particularly deleterious, and that their exclusion is one of the key hallmarks of a functional translation initiation site.
Examining the region downstream of nucleotide position 5 reveals that relative information values are elevated at positions 6, 9, 15, and 18. As discussed previously [30, 31], a 3base periodicity is characteristic of open reading frames. Relative information is elevated at each of these positions, because is depressed, and and are elevated (see Figure 9 position 6, Figure 8, and Supplementary Tables 3 and 4). The periodic elevation of relative information and the corresponding weights indicate that these positions positively contribute to the translationstart relative individual information (TRII) scores. Indeed, if TRII scores are calculated using positions −20 to 40 (data not shown), the distribution of scores is shifted to the right, and the scoring is better able to distinguish between the 0upAUG control test set and sets of putative nonfunctional start sites (e.g., the set in Figure 6 discussed above). Statistical analysis of weight matrices is described in Supplementary Material S.3 and Supplementary Table 2.
Note that each expression represents the of the probability that a given nucleotide will occur relative to its background probability, and the summing of these log terms represents the product of these probabilities which is the overall probability of a given individual sequence (the TRII score without a sampling correction). Hence, the weight matrix captures the essence of the consensus notion from a probability perspective.
Using a weight matrix to represent a consensus sequence is a natural extension of Schneider and colleagues' use of the weight matrix for sequence walkers [32–34]. The positional weight matrix (Figure 9) provides a fuller view of the consensus than the sequence logo format (Figure 8(c)) which is commonly used to represent a sequence consensus. Unlike a sequence logo, the positional weight matrix explicitly conveys deviations from background frequencies showing when nucleotides are underrepresented (negative matrix entries) or overrepresented (positive entries).
3. Conclusions
A TRII scoring method based on highconfidence translation initiation sites has been developed to assess translation initiation sites. The 0upAUG highconfidence sets are used to compute the TRII scoring weight matrix as well as to provide control test curves which, in addition to random sequence score distributions, allow for probabilistic assessment of individual TRII scores. In addition, comparison with control test curves gives powerful methods to analyze TRII score distributions for groups of translation initiation sites of special interest. The 0upAUG highconfidence sets also provide improved quantitative descriptions of the consensus motif for translation initiation in Drosophila. TRII score analysis of cDNAs containing upAUGs suggests that further experimental analysis of this class of cDNAs is warranted to assess their annotated translation initiation sites.
4. Methods
4.1. Translation Relative Individual Information (TRII) Scoring
The collections of genomic and cDNA sequences were stored in a relational database. The database schema is illustrated in Supplementary Figure 4. Informationtheoretic calculations were performed using a variety of stored procedures in the database. A listing of the control test set of 0upAUG start sites at positions −20 to 20 in sequences with 5^{'}UTRs ≥ 200, and their relative individual information (TRII) scores, are provided in Supplementary Material S.1.2. These TRII scores are based on using the reference set .
where the sampling correction was estimated as described previously [3, 4] assuming background frequencies of 0.25 for each nucleotide. In particular, we used the theoretical estimate of for . If the actual 5^{'}UTR background frequencies are used to estimate , the value increases by less than 0.00003 for .
4.2. Reconstruction of TRII Score Distributions
We estimated the fraction of AUG sites in a test set that were similar to optimized translation initiation sites and therefore likely to be functional (see, e.g., Figure 6) as follows: given , construct a new distribution using the values , where and denote two TRII score distributions, and represents an individual score (of a bin). Then choose the fraction that minimizes the sum of the differences squared between these values and the values of the actual test set distribution . For our computations, the distribution was based on the scores for and was based on the scores for (Table 1) or (Figure 7).
4.3. Information Calculator
We provide a web interface for performing calculations on sets of inputed aligned sequences (http://igs.wesleyan.edu/ > Databases and Tools). The interface generates a weight matrix from the aligned sequences so that relative information values and relative individual information scores can be calculated for sequences of interest. The interface can be used to assess potential translation initiation sites, or other kinds of motifs for which sets of aligned sequences with the motif are available.
Supplementary Material
List of Abbreviations
 TRII:

Translation relative individual information
 ORF:

Open reading frame
 BDGP:

Berkeley drosophila genome project
 upAUG:

Upstream AUG
 annAUG:

Annotated AUG
 UTR:

Untranslated region.
Declarations
Acknowledgments
The authors thank Robert Lane, William Gladstone, Laurel Appel, and Adam RobbinsPianka for careful reading of the paper, Rob Stewart, William Gladstone, and Adam RobbinsPianka for programming contributions, and Michael Lin and Manolis Kellis for communication of unpublished data. This work was supported in part by funds from the Howard Hughes Medical Institute to support undergraduate initiatives in the life sciences.
Authors’ Affiliations
References
 Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME Suite: tools for motif discovery and searching. Nucleic Acids Research 2009, 37(2):W202W208.View Article
 Stephens RM, Schneider TD: Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. Journal of Molecular Biology 1992, 228(4):11241136. 10.1016/00222836(92)90320JView Article
 Weir M, Eaton M, Rice M: Challenging the spliceosome machine. Genome Biology 2006., 7(1, article R3):
 Weir M, Rice M: Ordered partitioning reveals extended splicesite consensus information. Genome Research 2004, 14(1):6778.View Article
 Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 1997, 268(1):7894. 10.1006/jmbi.1997.0951View Article
 Shannon CE, Weaver W: The Mathematical Theory of Communication. University of Illinois Press, Urbanam, Ill, USA; 1949.MATH
 Schneider TD, Spouge J: Information content of individual genetic sequences. Journal of Theoretical Biology 1997, 189(4):427441. 10.1006/jtbi.1997.0540View Article
 Miyasaka H: The positive relationship between codon usage bias and translation initiation AUG context in Saccharomyces cerevisiae. Yeast 1999, 15(8):633637. 10.1002/(SICI)10970061(19990615)15:8<633::AIDYEA407>3.0.CO;2OView Article
 Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS: Genomewide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 2009, 324(5924):218223. 10.1126/science.1168978View Article
 BDGP Berkeley Drosophila Genome Project, 2002
 Stapleton M, Carlson J, Brokstein P, Yu C, Champe M, George R, Guarin H, Kronmiller B, Pacleb J, Park S, Wan K, Rubin GM, Celniker SE: A Drosophila fulllength cDNA resource. Genome Biology 2002, 3(12):research0080.1research0080.8. 10.1186/gb2002312research0080View Article
 Stapleton M, Liao G, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K, Yu C, Carlson J, George R, Celniker S, Rubin GM: The Drosophila gene collection: identification of putative fulllength cDNAs for 70% of D. melanogaster genes. Genome Research 2002, 12(8):12941300. 10.1101/gr.269102View Article
 Rogozin IB, Kochetov AV, Kondrashov FA, Koonin EV, Milanesi L:Presence of ATG triplets in untranslated regions of eukaryotic cDNAs correlates with a 'weak' context of the start codon. Bioinformatics 2001, 17(10):890900. 10.1093/bioinformatics/17.10.890View Article
 Hinnebusch AG, Jackson BM, Mueller PP: Evidence for regulation of reinitiation in translational control of GCN4 mRNA. Proceedings of the National Academy of Sciences of the United States of America 1988, 85(19):72797283. 10.1073/pnas.85.19.7279View Article
 Kochetov AV: Alternative translation start sites and hidden coding potential of eukaryotic mRNAs. BioEssays 2008, 30(7):683691. 10.1002/bies.20771View Article
 Kozak M: Constraints on reinitiation of translation in mammals. Nucleic Acids Research 2001, 29(24):52265232. 10.1093/nar/29.24.5226View Article
 Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O'Shea EK, Weissman JS: Global analysis of protein expression in yeast. Nature 2003, 425(6959):737741. 10.1038/nature02046View Article
 Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS: Genomewide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 2009, 324(5924):218223. 10.1126/science.1168978View Article
 Clark AG, Eisen MB, Smith DR, et al., Evolution of genes and genomes on the Drosophila phylogeny. Nature 2007, 450(7167):203218. 10.1038/nature06341View Article
 Lin MF, Carlson JW, Crosby MA, et l.,: Revisiting the proteincoding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Research 2007, 17(12):18231836. 10.1101/gr.6679507View Article
 Stark A, Lin MF, Kheradpour P, et al.,: Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 2007, 450(7167):219232. 10.1038/nature06340View Article
 Kozak M: Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene 2005, 361(12):1337.View Article
 Kozak M: Initiation of translation in prokaryotes and eukaryotes. Gene 1999, 234(2):187208. 10.1016/S03781119(99)002103View Article
 Kozak M: A progress report on translational control in eukaryotes. Science's STKE 2001, 2001(71):pe1.
 Shultzaberger RK, Roberts LR, Lyakhov IG, Sidorov IA, Stephen AG, Fisher RJ, Schneider TD: Correlation between binding rate constants and individual information of E. coli Fis binding sites. Nucleic Acids Research 2007, 35(16):52755283. 10.1093/nar/gkm471View Article
 Cavener DR: Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucleic Acids Research 1987, 15(4):13531361. 10.1093/nar/15.4.1353View Article
 Cavener DR, Ray SC: Eukaryotic start and stop translation sites. Nucleic Acids Research 1991, 19(12):31853192. 10.1093/nar/19.12.3185View Article
 Feng Y, Gunter LE, Organ EL, Cavener DR: Translation initiation in Drosophila melanogaster is reduced by mutations upstream of the AUG initiator codon. Molecular and Cellular Biology 1991, 11(4):21492153.View Article
 Kozak M:An analysis of noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Research 1987, 15(20):81258148. 10.1093/nar/15.20.8125View Article
 Yin C, Yau SST: A Fourier characteristic of coding sequences: origins and a nonFourier approximation. Journal of Computational Biology 2005, 12(9):11531165. 10.1089/cmb.2005.12.1153View Article
 Fickett JW: Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 1982, 10(17):53035318. 10.1093/nar/10.17.5303View Article
 Gadiraju S, Vyhlidal CA, Leeder JS, Rogan PK: Genomewide prediction, display and refinement of binding sites with information theorybased models. BMC Bioinformatics 2003., 4, article 38:
 Schneider TD: Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences. Nucleic Acids Research 1997, 25(21):44084415. 10.1093/nar/25.21.4408View Article
 Schneider TD: Consensus sequence Zen. Appl Bioinformatics 2002, 1(3):111119.
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.