Open Access

Optimal reference sequence selection for genome assembly using minimum description length principle

EURASIP Journal on Bioinformatics and Systems Biology20122012:18

DOI: 10.1186/1687-4153-2012-18

Received: 14 January 2012

Accepted: 11 September 2012

Published: 27 November 2012

Abstract

Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.

1 Introduction

Rissanen’s minimum description length (MDL) is an inference tool that learns regular features in the data by data compression. MDL uses “code-length” as a measure to identify the best model amongst a set of models. The model which compresses the data the most and presents the smallest code-length is considered the best model. MDL principle stems from Occam’s razor principle which states that “entities should not be multiplied beyond necessity”, http://www.cs.helsinki.fi/group/cosco/Teaching/Information/2009/lectures/lecture5a.pdf, stated otherwise, the simplest explanation is the best one,[15]. Therefore, MDL principle tries to find the simplest explanation (model) to the phenomenon (data).

The MDL principle has been used successfully in inferring the structure of gene regulatory networks[613], compression of DNA sequences[1418], gene clustering[1921], analysis of genes related to breast cancer[2225] and transcription factor binding sites[26].

The article is organized as follows. Section 4 discusses briefly, the variants of MDL and their application to the comparative assembly. Section 4 explains the algorithm used for the purpose. Section 4 elaborates on the simulations carried out to test the proposed scheme. Section 4 explains the results and finally Section 4 points out the main features of this article.

2 Methods

The relevance of MDL to Genome assembly can be realized by understanding that Genome assembly is an inference problem where the task at hand is to infer the novel genome from read data obtained from sequencing. Genome assembly is broadly divided into comparative assembly and de-novo assembly. In comparative assembly, all reads are aligned with a closely related reference sequence. The alignment process may allow one or more mismatches between each individual read and the reference sequence depending on the user. The alignment of all the reads creates a “Layout”, beyond which the reference sequence is not used any more. The layout helps in producing a consensus sequence, where each base in the sequence is identified by simple majority amongst the bases at that position or via some probabilistic approach. Therefore, this “Alignment-Layout-Consensus” paradigm is used by genome assemblers to infer the novel genome,[2735].

Comparative assembly, therefore, is an inference problem which requires to identify a model that best describes the data. It begins the process by identifying a model, the “reference sequences”, most closely related to the set of reads. It then uses the set of reads to build on this model producing a model which overfits the data, the “novel genome”,[27, 28, 34, 3641]. The task of MDL is to identify the model that best describes the data and within comparative assembly framework the same meaning applies to finding the reference sequences that best describes the set of reads.

MDL presents three variants Two-Part MDL, Sophisticated MDL and MiniMax Regret[1]. The application of these will be briefly discussed in what follows.

2.1 Two-part MDL

Also called old-style MDL, the two-part MDL chooses the hypothesis which minimizes the sum of two components:

  1. A)

    The code-length of the hypothesis.

     
  2. B)

    Code-length of the data given the hypothesis.

     

The two-part MDL selects the hypothesis which minimizes the sum of the code-length of the hypothesis and code-length of the data given the hypothesis,[1, 4247]. The two-part MDL fits perfectly to the comparative assembly problem. The potential hypothesis which is closely related to the data, in comparative assembly, happens to be the reference sequence whereas the data itself happens to be the read data obtained from the sequencing schemes.

2.2 Sophisticated MDL

The two components of the two-part MDL can be further divided into three components:

  1. A)

    Encoding the model class: l(M i ), where M i belongs in model class, and l(M i ) denotes the length of the model class in bits.

     
  2. B)

    Encoding the parameters (θ) for any model M i  : l i (θ).

     
  3. C)

    Code-length of the data given the hypothesis is lo g 2 1 p θ ¯ ( X ) .

     

where p θ ¯ ( X ) denotes the distribution of the Data X according to the model θ ¯ . The three part code-length assessment process again can be converted into a two-part code-length assessment by combining steps B and C into a single step B.

  1. A)

    Encoding the model class: l(M i ), where M i belongs to any Model class.

     
  2. B)

    Code-length of the Data given the hypothesis class ( M i ) = l ( M i ( X ) ) , where X stands for any data set.

     

Item (B) above, i.e., the ‘length of the encoded data given the hypothesis’ is also called the “stochastic complexity” of the model. Furthermore, if the data is fixed, or if item (B) is constant, then the job reduces to minimizing l(M i ), otherwise, reducing part (A),[1, 4853].

2.3 MiniMax regret

MiniMax Regret relies on the minimization of the worst case regret,[49, 50, 5359]:
min M max X loss ( M , X ) min M ̂ loss ( M ̂ , X ) ,
(1)
where M can be any model, M ̂ represents the best model in the class of all models and X denotes the data. The Regret, R M i , X , is defined as
R M i , X = loss ( M i , X ) min M ̂ loss ( M ̂ , X )
(2)

Here the loss function, loss ( M i , X ) , could be defined as the code-length of the data X , given the model class M i . The application of Sophisticated MDL in the framework of comparative assembly will be discussed in what follows.

2.4 Sophisticated MDL and genome assembly

In reference assisted assembly, also known as comparative assembly, a reference sequence is used to assemble a novel genome from a set of reads. Therefore, the best model is the reference sequence most closely related to the novel genome and the data at hand are the set of reads.

However, it should be pointed out that the aim is not to find a general model, rather, the aim is to find a “model that best overfits the data” since there is just one or maybe two instances of the data, based on how many runs of the experiment took place. One “run” is a technical term specifying that the genome was sequenced once and the data was obtained. The term “model that best overfits the data” can be explained using the following example.

Assume one has three Reads {X, Y, and Z} each having n number of bases. Say reference sequences (L) and (M), where (L) = XXYYZZ and (M) =XYZ contains all three reads placed side by side. Since both models contain all the three reads, the stochastic complexity of both (L) and (M) is the same and both overfit the data perfectly. However, since (M) is shorter than (L), therefore (M) is the model of choice on account of being the model that “best” overfits the data.

To formalize the MDL process, the first step would be to identify the following considerations:

  1. A)

    Encoding the model class: l(M i ), M i belongs to Model classes.

     
  2. B)

    Encoding the parameters (θ) of the Model M i  : l i (θ).

     
  3. C)

    Code-length of the data given the hypothesis is lo g 2 1 p θ ¯ ( D ) .

     

The model class in comparative assembly would be the reference (Ref.) sequence itself. The parameters of the model θ, are such that, θ {−1, 0, 1}. In the process of encoding the model class regions of the genome that are covered by the reads of the unassembled genome are flagged with “1”(s). Areas of the Ref. genome not covered by the reads are flagged as “0”(s), whereas areas of the Ref. genome that are inverted in the novel genome are marked with “−1”(s). In the end, every base of the Ref. sequence is flagged with {−1, 0, 1}. Therefore, the code-length of the parameters of the model is proportional to length of the sequence.

Data given the hypothesis is typically defined as “Number of reads that align to the Ref. sequence”. In the case presented below “data given the hypothesis” is defined in an inverted fashion as the “Number of reads that do not align to the reference sequence”. These two are interchangeable as the “Total number of reads” is the sum total of the “number of reads that aligned to the Ref.” and the “number of reads that do not align to the Ref.”.

Table1 shows that choosing the reference sequence having the highest number of reads present is not a sufficient condition for selecting the optimal reference sequence. The simulation carried out compared two reference sequences Fibrobacter succinogenes S85 (NC_013410.1),[60, 61], and Human Chromosome 21 (AC_000044.1),[6264], with the reads of Pseudomonas aeruginosa PAb1 (SRX000424),[48, 65, 66]. It shows that in order to choose the optimal reference sequence one has to take into account both the “Code-length of the model” and “Number of reads found” to be the sufficient conditions for choosing the optimal reference sequence.
Table 1

Counting number of reads not enough

S.No.

Reference sequence

Number of bases in genomes

Number of reads found

1

Fibrobacter succinogenes subsp. succinogenes S85 (NC_013410.1)

3842635

157

2

Human Chromosome 21 (AC_000044.1)

32992206

158

The table shows that choosing the reference sequence which has the highest number of reads present is not a sufficient condition. Just by looking at the “Data given the model” ≡“Number of reads found” one ends up choosing Human Chromosome 21. However, looking at the fact that Chromosome 21 is about 9×larger than S85 one realizes that actually S85 is the model of choice. Furthermore, S85 is a bacterial genome whereas Chromosome 21 comes from a eukaryote genome. PAb1 is also a bacteria, therefore, S85 is most definitely the model of choice.

Therefore, a simple yet novel scheme is proposed for the solution to the problem, see Figure1 and Table2. The proposed scheme follows the three assessment process of Sophisticated MDL. The MDL based proposed scheme stores the model class (Ref. sequence), the parameters of the model (where each base of the sequence is flagged with {−1, 0, 1}) and the data given the hypothesis (reads of the novel genome that do not align to the Ref. sequence) is one file. The file is than encoded using either Huffman Coding[6770] or Shannon-Fano coding[6871] to determine the code-length. For a simplistic three bits per character coding the code-length is measured according to Equation (3). The proposed scheme not only allows to determine the best model, amongst the pool of models to choose from, but also improves the model to be better suited according to the novel genome to be assembled. This is done by identifying all insertions and inversions, larger than one read length. It then removes those insertions and rectifies those inversions to get a better model, better suited to assemble the novel genome compared to what was started from, see Figures2 and3.
Code length = ( Length Ref. Seq. × 3 ) + ( Length Parameters of the Model × 3 ) + ( Length Read × 3 × No. of Unique Unaligned Reads ) .
(3)
https://static-content.springer.com/image/art%3A10.1186%2F1687-4153-2012-18/MediaObjects/13637_2012_Article_26_Fig1_HTML.jpg
Figure 1

MDL proposed scheme: The output of the system shows that the three components of the encoding scheme are separated from one another by “>”. The scheme follows the format “Model > Model given the Data > Data given the hypothesis”. In the genome assembly framework the scheme mentioned above translates into “Reference Sequence >Reference Sequence according to the set of reads > Set of reads according to the Reference sequence”. “Model given the Data” is identified using {−1, 0, 1}. “1”(s) represent the base locations where the reads are found. “0”(s) represents the locations which are not covered by any read. “−1”(s) represents the locations of the genome that are inverted.

https://static-content.springer.com/image/art%3A10.1186%2F1687-4153-2012-18/MediaObjects/13637_2012_Article_26_Fig2_HTML.jpg
Figure 2

Correcting inversions in the reference sequence. (a) Reads are derived from the novel sequence. (b) The reference sequence, S R , contains two inversions, shown as yellow and blue regions. (c) The sequence generated θ has both yellow and blue regions rectified. Notice that using a simple ad-hoc scheme of counting the number of reads in the reference sequence one would have made use of (b) for assembly of novel genome. However, using MDL one can now use (c) for the assembly of the novel genome.

https://static-content.springer.com/image/art%3A10.1186%2F1687-4153-2012-18/MediaObjects/13637_2012_Article_26_Fig3_HTML.jpg
Figure 3

Removing insertions in the reference sequence. (a) Reads are derived from the novel sequence. (b) The reference sequence, S R , contains two insertions, shown as shaded grey boxes. (c) The proposed MDL process generates θ. The process removes only those insertions which are larger than τ1 but smaller than τ2; where τ1 and τ2 are user-defined. To remove the other insertion the value of τ2 could be increased.

Table 2

Summary of the experiment using three reads {ATAT, GGGG, CCAA} and three reference sequences {1, 2, 3}

   

Reads that do not align to the reference sequence

Data given the hypothesis (Bits)

  

Code-length (Bits)

  

Model given by the Data

    

Code-length

S.No.

Ref. Seq.

   

Regret

Proposed scheme

(Bits)

1

ATAT CGGGG CTATA

1111011110-1-1-1-1

CCAA

12

0

ATATCGGGGCATAT>1111 0 1111 0 -1-1-1-1>CCAA

102

2

ATGGGCCCTTATTGC

000000000000000

ATAT>GGGG>CCAA

42

30

ATGGGCCCTTATTGC> 000000000000000 >ATAT>GGGG >CCAA

138

3

GGGGCCCCGGGG

1111-1-1-1-11111

ATAT>CCAA

27

15

GGGGCCCCGGGG>1111-1-1-1-11111>ATAT>CCAA

105

Regret is defined as R M i , X = loss ( M i , X ) min M ̂ loss ( M ̂ , X ) . Here the loss function, loss ( M i , X ) , happens to be code-length of the data X , given the model class M i . Whereas, “Data given the hypothesis”, is the code-length of the “Reads that do not align to the reference sequence”. The code-length in the last column is measured according to Equation (3). The experiment shows that given the MDL proposed scheme Ref. 1 is the optimal choice for a reference sequence.

Algorithm 1 MDL Analysis of a Ref. sequence given aset of reads of the unassembled genome

https://static-content.springer.com/image/art%3A10.1186%2F1687-4153-2012-18/MediaObjects/13637_2012_Article_26_Equa_HTML.gif

3. MDL algorithm

The pseudo code for analysis using sophisticated MDL and the scheme proposed in Section 4 is shown in Algorithm 1. Given the reference sequence S R and K set of reads, {r1,r2,…,r K } R, obtained from the FASTQ[72, 73] file, the first step in the inference process is to filter all low quality reads. Lines 3–10 filters all the reads that contain the base N in them and also the reads which are of low quality leaving behind a set of O reads to be used for further analysis. This pre-processing step is common to all assemblers. Once all the low quality reads are filtered out, the remaining set of O reads are sorted and then collapsed so that only unique reads remain.

Lines 13–27 describe the implementation of the proposed scheme as defined in Section 4. Assume that S R is l bases long, and the length of each read is p. Therefore, ϕ S R picks up p bases at a time from S R and checks whether or not ϕ S R is present in the set of collapsed reads R. In the event ϕ S R R then the corresponding location on S R , i.e., j → j + p are flagged with “1(s)”. If ϕ S R R , then invert ϕ S R ψ S R and check whether or not ψ S R R . If yes, then mark the corresponding location on S R , i.e., j → j + p with “−1(s)” and flag ϕ S R to be present in R. Otherwise, mark the corresponding locations on S R as “0(s)”.

Lines 28–34 generates a modified sequence θ which has all the inversions rectified in the original sequence S R . Lines 35–44 identifies all insertions larger than τ1 and smaller than τ2 and removes them, see Figure3. Here τ1 and τ2 are user-defined. Care should be taken to avoid removing very large insertions as this may affect the overall performance in deciding the best sequence for genome assembly. Lines 45–47 removes all the reads that are present in the original S R and the modified sequence θ identified by flags 1 and −1. In the end the code-lengths are identified by any popular encoding scheme like Huffman[6770] or Shannon-Fano coding[6871]. If ξ is the smallest code-length amongst all models then use θ as a reference for the assembly of the unassembled genome rather than using S R .

4 Results

Simulations were carried out on both synthetic data as well as real data. At first, the MDL process was analyzed on synthetic data on four different sets of mutations by varying the number and length of {Single nucleotide polymorphisms (SNPs), Inversions, Insertions, and Deletions}. The experiments using synthetic data were carried out by generating a sequence S N . The set of reads were derived from S N and sorted using quick sort algorithm[74, 75]. Each experiment modified S N to produce two reference sequences SR 1 and SR 2 by randomly putting in the four set of mutations. The choice of the best reference sequence was determined by the code-length generated by the MDL process. See Tables3,4,5, and6 for results.
Table 3

Variable number of SNPs: the experiment shows the effect of increasing the number of SNPs on choice of the reference sequence

Ref. Seq.

SNPs

No. of inversions

No. of insertions

No. of deletions

Code-length using proposed scheme (Kb)

1

183

52 / 52

62 / 59

62

1815.14

2

224

50 / 51

66 / 58

63

1843.35

SR 2 has higher number of SNPs as opposed to SR 1. The code-length suggests that SR 1 is the model of choice as it has a smaller code-length. The results show that the MDL scheme works successfully on variable number of SNPs by choosing the model with a lower number of SNPs in them.

Table 4

Variable number of insertions: the experiment shows the effect of increasing the number of insertions on choice of the reference sequence

Ref. Seq.

SNPs

No. of inversions

No. of insertions

No. of deletions

Code-length using proposed scheme (Kb)

1

0

0

136 / 196

0

1200.3

2

0

0

132 / 203

0

1228.25

The location and length of these insertions was chosen randomly. 136 196 shows that out of 196 insertions in SR 1 only 136 were removed. The remaining insertions were not recovered due to the choice of τ1 and τ2. SR 2 has higher number of insertions as opposed to SR 1. The code-length suggests that SR 1is the model of choice as it has a smaller code-length.

Table 5

Variable number of deletions: the experiment shows the effect of increasing the number of deletions on choice of the reference sequence

Ref. Seq.

SNPs

No. of inversions

No. of insertions

No. of deletions

Code-length using proposed scheme (Kb)

1

0

0

2 / 0

182

1997.28

2

0

0

3 / 0

189

2015.35

The location and length of these deletions was chosen randomly. SR 2 has higher number of deletions as opposed to SR 1. The code-length suggests that SR 1 is the model of choice as it has a smaller code-length. The experiment show that although no insertions were put in the actual sequence yet still two and three insertions were found for SR 1and SR 2, respectively. This may be due to a large section of reads that could not align to the reference sequence on the edges of these deletions.

Table 6

Variable number of inversions: the experiment shows the proposed scheme is robust to the number of inversions in the reference sequence

Ref. Seq.

SNPs

No. of inversions

No. of insertions

No. of deletions

Code-length using proposed scheme (Kb)

1

0

0

0

0

586.04

2

0

176 / 176

0

0

586.04

Both SR 1 and SR 2 have the same code-length. This is because the MDL scheme not only detected all the inversions for SR 2 but also recovered all of them. So effectively SR 2 ≡ SR 1 after the MDL process as explained in Figure2.

Once the robustness of MDL scheme on each of the four types of mutations was confirmed two-set of experiments were carried out on real data using Influenza viruses A, B, and C which belong to the Orthomyxoviridae group. Influenza virus A has five different strains, i.e., {H1N1, H5N1, H2N2, H3N2, H9N2}, while Influenza viruses B and C each have just one. The genomes of Influenza viruses is divided into a number of segments. Influenza virus A and B each have eight segments while virus C has seven segments,[7678]. Amongst the first segments of each of the viruses only one was randomly selected and then modified to be our novel genome, S N . Reads were then derived from S N and compared with all the seven reference sequences. See Table7 for results.
Table 7

Simulations with Influenza virus A, B, and C

S.No.

Ref. Seq. (Influenza virus)

No. of inversions

No. of deletions

Code-length using proposed scheme (Kb)

1

A, H1N1 (NC_002023.1)

0 / 4

1

254.109

2

A, H5N1 (NC_007357.1)

0 / 4

1

254.109

3

A, H2N2 (NC_007378.1)

0 / 4

1

254.109

4

A, H3N2 (NC_007373.1)

0 / 4

1

254.109

5

A, H9N2 (NC_004910.1)

0 / 4

1

254.109

6

B (NC_002204.1)

4 / 4

1

68.62

7

C (NC_006307.1)

0 / 4

1

254.027

One of the sequences from Influenza virus {A, B, C} was randomly selected and modified to include {SNPs =7, inversions =4, deletions =1, insertions =3}. As Influenza virus A has five different strains while both Influenza viruses B and C each have one the MDL process was used to compare the seven sequences to determine which is the best reference sequence. Ref. Seq. 6, Influenza virus B was found to have the smallest code-length (68.62Kb), and is therefore, the model of choice. The experiment also shows that given the optimal reference sequence, in this case Influenza virus B, the MDL process rectifies all inversions (4/4). However, given non-optimal reference sequences, the proposed MDL process is not able to rectify the inversions (0/4). So the proposed algorithm chooses the optimal reference sequence, and given the optimal reference sequence if not all, at least most of the inversions are also corrected.

The second-set of experiments analyzed the performance of the MDL proposed scheme on reference sequences of various lengths. The test was designed to check whether the proposed scheme chooses smaller reference sequence with more number of unaligned reads or does it choose the optimal reference sequence for assembly. The reads were derived from Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1. All the reference sequences used in this test were also derived from the same H1N1 virus, however, with different lengths, see Tables 8 and9.
Table 8

The experiment uses the proposed MDL scheme on the same set of reads but different set of reference sequences

S.No.

Ref. Seq. (%)

No. of unaligned reads

Code-length (KB)

Execution time (s)

Length of new Seq.

1

1

696

128.60

0.046

14

2

2

696

128.73

0.031

47

3

5

693

128.575

0.046

113

4

10

684

127.576

0.046

229

5

25

668

126.615

0.093

565

6

50

650

126.615

0.109

650

7

100

3

14.276

0.078

2342

8

150

2

21.164

0.062

2341

9

200

2

27.808

0.124

2341

10

300

2

41.525

0.140

2341

The set of reads contained 3817 reads all of which were derived from ‘Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1, complete sequence’. Out of 3817 reads the method extracted 696 unique reads which were then used in the MDL proposed scheme. All the reference sequences were derived from the same Influenza A (H1N1) virus. Ref. Seq. 1% used in S.No. 1, has a length which is 1% of the actual genome. Similarly Ref. Seq. 25% has a length which is a quarter of the length of the actual genome. All other genomes were derived in a similar way. For, e.g., Ref. Seq. 200% has two H1N1 viruses concatenated together making the length twice that of the original H1N1 sequence. The code-length is calculated using Equation (3). The results show that the MDL proposed scheme chooses the best reference sequence, one which has the smallest code-length as determined by Equation (3). The MDL scheme does not choose smaller reference sequences with more unaligned reads rather than choosing larger reference sequence with smaller unaligned reads. The experiment also proves the correctness of the optimal reference sequence as it chooses Ref. Seq. 7, (shown underlined), since it has the smallest code-length, as the optimal reference sequence. It was Ref. Seq. 7 from which all the reads were derived from. Since the MDL scheme chooses Ref. Seq. 7 as the optimal sequence, the experiment also proves the correctness of the reference sequence chosen.

Table 9

The exeriment tests the proposed MDL scheme on a single set of reads yet on a number of reference sequences

S.No.

Ref. Seq. (%)

No. of unaligned reads

Code-length (KB)

Length of new Seq.

1

75

172

25.91

1755

2

85

148

25.10

1989

3

95

123

24.20

2223

4

100

109

23.62

2341

5

105

108

24.22

2458

6

115

107

25.50

2692

7

125

106

26.78

2926

The set of reads, 390 in total, were derived from ‘Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1, complete sequence’ using the ART read simulator for NGS with read length 30, standard deviation 10, and mean fragment length of 100,[79]. Similarly the reference sequences were also derived from the same H1N1 virus. Ref. Seq. 75% used in S.No. 1, has a length which is 75% of the actual genome. Similarly Ref. Seq. 125% has a quarter of the actual genome concatenated with the complete H1N1 genome making the total length 125% of H1N1. All other genomes were derived in a similar way. The code-length is calculated using Equation (3). The results show that the MDL proposed scheme chooses the correct reference sequence, Ref. Seq. 100%, (shown underlined) even when all the contending sequences are closely related to one another in terms of their genome and length.

5 Discussion

The MDL proposed scheme was tested using two-set of experiments. In the first set the robustness of the proposed scheme was tested using reference sequences, both real and simulated, having four types of mutations {Inversions, Insertions, Deletions, SNPs} compared to the novel genome. This was done with the help of a program called change_sequence. The program ‘change_sequence’ requires the user to input Υ m , the probability of mutation, in addition to the original sequence from which the reference sequences are being derived. It start by traversing along the length of the genome, and each time it arrives at a new base, a uniformly distributed random generator generates a number between 0 and 100. If the number generated is less than or equal to Υ m a mutation is introduced. Once the decision to introduce a mutation is made, the choice of which mutation still needs to be made. This is done by rolling a biased four sided dice. Where each face of the dice represents a particular mutation, i.e., {inversion, deletion, insertion, and SNPs}. The percentage bias for each face of the dice is provided by the user as four additional inputs, Υinv, for the percentage bias for inversions, Υindel, representing percentage bias for insertions and deletions and ΥSNP for SNPs. If the dice chooses inversion, insertion or deletion as a possible mutation it still needs to choose the length of the mutation. This requires one last input from the user, Υlen, identifying the upper threshold limit of the length of the mutation. A uniformly distributed random generator generates a number between 1 and Υlen, and the number generated corresponds to the length of the mutation.

The proposed MDL scheme is shown to work successfully, as it chooses the optimal reference sequence to be the one which has smaller number of SNPs, see Table3, smaller number of insertions, see Table4, and smaller number of deletions compared to the novel genome, see Table5. The proposed MDL scheme is also seen to detect and rectify most, if not all, of the inversions present in the reference sequence, see Table6. Since the code-length of SR 1 is the same as SR 2, and all the inversions of SR 2 are rectified, the corrected SR 2 sequence and SR 1 sequence are equally good for reference assisted assembly.

The experiment carried out using Influenza viruses is shown in Table7. One sequence was randomly chosen amongst the seven sequences and modified at random locations, using the same ‘change_sequence’ program, to form the novel sequence S N . The novel sequence contained {SNPs = 7, inversions = 4, deletions = 1, insertions = 3} as compared to the original sequence. The MDL process used the reads derived from S N to compare seven sequences and determined Influenza virus B to be optimal reference sequence as it had the smallest code-length. The MDL process rectified all inversions while only one insertion was found. This meant that the remaining two insertions were smaller than τ1. The set of reads and Influenza virus B was then fed into MiB (M DL-I DITAP-B ayesian estimation comparative assembly pipeline)[80]. The MiB pipeline removes insertions and rectifies inversions using the MDL proposed scheme. IDITAP is a de-bruijn graph based denovo assembler that I dentifies the D eletions and I nserts them aT A ppropriate P laces. BECA (B ayesian E stimator C omparative A ssembler) helps in rectifying all the SNPs. The novel genome reconstructed by the MiB pipeline was one contiguous sequence with a length of 2368 bases and a completeness of 96.62%.

The second-set of experiment tests the correctness of the MDL proposed scheme, by testing the MDL scheme on a single set of reads but on a number of different reference sequences having a wide range of lengths. In the first test 3817 reads were derived from ‘Influenza A virus (H1N1) segment 1’ without any mutations, of which only 696 reads remained after collapsing duplicate reads. The reference sequences were also derived from the same H1N1 virus, with reference sequence (Ref. Seq.) 1% having a length which is 1% of the actual genome. Similarly Ref. Seq. 25% has a length which is a quarter of the length of the actual genome. Similarly Ref. Seq. 125% has a quarter of the actual genome concatenated with the complete H1N1 genome making the total length 125% of H1N1. All other reference sequences were derived in a similar way, see Table 8. The unique set of reads and the reference sequences were tested using the MDL proposed scheme, where the code-length was calculated using Equation (3). The results show that the MDL scheme does not choose smaller reference sequences with more unaligned reads rather it chooses the correct reference sequence, Ref. Seq. 7. It was Ref. Seq. 7 from which all the reads were derived from. Since the MDL scheme chooses Ref. Seq. 7 as the optimal sequence, this experiment further proves the correctness of the reference sequence chosen.

Lastly, the above experiment was repeated using a single set of reads derived from the same H1N1 virus segment 1, but this time containing mutations. The set of reads, 390 in total, were derived using the ART read simulator for NGS with read length 30, standard deviation 10, and mean fragment length of 100, [PUT ART Reference], see Table9. The results show that the MDL proposed scheme chooses the correct reference sequence, Ref. Seq. 100%, even when all the contending reference sequences are closely related to one another in terms of their genome and length.

All simulations were carried out on Intel Core i5 CPU M430 @ 2.27 GHz, 4 GB RAM. Execution time of MDL proposed scheme have been provided in Table8.

6 Conclusions

The article explored the application of Two-Part MDL qualitatively and the application of Sophisticated MDL both qualitatively and quantitatively for selection of the optimal reference sequence for comparatively assembly. The article compared the MDL scheme with the standard method of “counting the number of reads that align to the reference sequence” and found that the standard method is not sufficient for finding the optimal sequence. Therefore, the proposed MDL scheme encompassed within itself the standard method of ‘counting the number of reads’ by defining it in an inverted fashion as ‘counting the number of reads that did not align to the reference sequence’ and identified it as the ‘data given the hypothesis’. Furthermore, the proposed scheme included the model, i.e., the reference sequence, and identified the parameters ( θ M i ) for the model (M i ) by flagging each base of the reference sequence with {−1, 0, 1}. The parameters of the model helped in identifying inversions and thereafter rectifying them. It also identified locations of insertions. Insertions larger than a user defined threshold τ1 and smaller than τ2 were removed. Therefore, the proposed MDL scheme not only chooses the optimal reference sequence but also fine-tunes the chosen sequence for a better assembly of the novel genome.

Experiments conducted to test the robustness and correctness of the MDL proposed scheme, both on real and simulated data proved to be successful.

Declarations

Acknowledgements

This article has been partly funded by the University of Engineering and Technology, Lahore, Pakistan (No. Estab/DBS/411, Dated Feb 16, 2008), National Science Foundation grant 0915444 and Qatar National Research Fund—National Priorities Research Program grant 09-874-3-235. The first author would like to extend special thanks to his family. The authors acknowledge the Texas A&M Supercomputing Facility (http://sc.tamu.edu/) for providing computing resources useful in conducting the research reported in this article.

Authors’ Affiliations

(1)
Department of Electrical and Computer Engineering, Texas A&M University
(2)
Department of Electrical Engineering, University of Engineering & Technology
(3)
Department of Chemical Engineering, Texas A&M University
(4)
Department of Electrical and Computer Engineering

References

  1. Roos T Helsinki: Helsinki University Printing House; 2007.
  2. Domingos P: The role of Occam’s razor in knowledge discovery. Data Min Knowledge Discovery 1999, 3(4):409-425. 10.1023/A:1009868929893View Article
  3. Li M, Vitányi P: An Introduction to Kolmogorov Complexity and its Applications. New York: Springer-Verlag Inc.; 2008.MATHView Article
  4. Rasmussen C, Ghahramani Z: Occam’s razor. Adv. Neural Inf. Process Systs 2001, 13: 294-300.
  5. Vapnik V: The Nature of Statistical Learning Theory. New York: Springer-Verlag Inc.; 2000.MATHView Article
  6. Dougherty J, Tabus I, Astola J: Inference of gene regulatory networks based on a universal minimum description length. EURASIP J. Bioinf. Systs. Biol 2008, 2008: 1-11.View Article
  7. Zhao W, Serpedin E, Dougherty E: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22(17):2129. 10.1093/bioinformatics/btl364View Article
  8. Chaitankar V, Ghosh P, Perkins E, Gong P, Deng Y, Zhang C: A novel gene network inference algorithm using predictive minimum description length approach. BMC Systs. Biol 2010, 4(Suppl 1):S7. 10.1186/1752-0509-4-S1-S7View Article
  9. Androulakis I, Yang E, Almon R: Analysis of time-series gene expression data: Methods, challenges, and opportunities. Annual Rev. Biomed. Eng 2007, 9: 205-228. 10.1146/annurev.bioeng.9.060906.151904View Article
  10. Lähdesmäki H, Shmulevich I, Yli-Harja O: On learning gene regulatory networks under the Boolean network model. Mach. Learn 2003, 52: 147-167. 10.1023/A:1023905711304MATHView Article
  11. Chaitankar V, Zhang C, Ghosh P, Perkins E, Gong P, Deng Y: Gene regulatory network inference using predictive minimum description length principle and conditional mutual information. In IEEE International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, 2009. IJCBS09. (Shanghai, China; 2009:487-490.View Article
  12. Dougherty E: Validation of inference procedures for gene regulatory networks. Curr.Genom 2007, 8(6):351. 10.2174/138920207783406505MathSciNetView Article
  13. Zhou X, Wang X, Pal R, Ivanov I, Bittner M, Dougherty E: A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks. Bioinformatics 2004, 20(17):2918-2927. 10.1093/bioinformatics/bth318View Article
  14. Korodi G, Tabus I: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf Systs. (TOIS) 2005, 23: 3-34. 10.1145/1055709.1055711View Article
  15. Korodi G, Tabus I, Rissanen J, Astola J: DNA sequence compression-Based on the normalized maximum likelihood model. IEEE Signal Process. Mag 2006, 24: 47-53.View Article
  16. Tabus I, Korodi G, Rissanen J: DNA sequence compression using the normalized maximum likelihood model for discrete regression. In IEEE Proceedings on Data Compression Conference, Snowbird. (Utah, USA; 2003:253-262.
  17. Evans S, Markham S, Torres A, Kourtidis A, Conklin D: An improved minimum description length learning algorithm for nucleotide sequence analysis. In IEEE Fortieth Asilomar Conference on Signals, Systems and Computers, 2006. ACSSC’06. (Pacific Grove, CA; 2006:1843-1850.View Article
  18. Milosavljević A, Jurka J: Discovery by minimal length encoding: a case study in molecular evolution. Mach. Learn 1993, 12: 69-87.
  19. Jornsten R, Yu B: Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics 2003, 19(9):1100. 10.1093/bioinformatics/btg039View Article
  20. Tabus I, Astola J: Clustering the non-uniformly sampled time series of gene expression data. In Proceedings of the Seventh International Symposium on Signal Processing and its Applications, ISSPA 2003, vol. 2. (Paris, France; 2003:61-64.View Article
  21. Jain A: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett 2010, 31(8):651-666. 10.1016/j.patrec.2009.09.011View Article
  22. Evans S, Kourtidis A, Markham T, Miller J, Conklin D, Torres A: MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP J. Bioinf. Syst. Biol 2007, 2007: 1-16.View Article
  23. El-Sebakhy E, Faisal K, Helmy T, Azzedin F, Al-Suhaim A: Evaluation of breast cancer tumor classification with unconstrained functional networks classifier. In the 4th ACS/IEEE International Conf. on Computer Systems and Applications. (Los Alamitos, CA, USA (0); 2006:281-287.
  24. Bulyshev A, Semenov S, Souvorov A, Svenson R, Nazarov A, Sizov Y, Tatsis G: Computational modeling of three-dimensional microwave tomography of breast cancer. IEEE Trans. Biomed. Eng 2001, 48(9):1053-1056. 10.1109/10.942596View Article
  25. Bickel D: Minimum description length methods of medium-scale simultaneous inference. Ottawa: Ottawa Institute of Systems Biology, Tech Rep; 2010.
  26. Schug J, Overton G: Modeling transcription factor binding sites with Gibbs sampling and minimum description length encoding. In Proc Int Conf Intell Syst Mol Biol, vol. 5. (Halkidiki, Greece; 1997:268-271.
  27. Wajid B, Serpedin E: Review of general algorithmic features for genome assemblers for next generation sequencers. Genomics, Proteomics & Bioinformatics 2012, 10(2):58-73. 10.1016/j.gpb.2012.05.006View Article
  28. Wajid B, Serpedin E: Supplementary information section: review of general algorithmic features for genome assemblers for next generation sequencers. Genomics, Proteomics & Bioinformatics 2012, 10(2):58-73. [https://sites.google.com/site/bilalwajid786/research] [] 10.1016/j.gpb.2012.05.006View Article
  29. Miller J, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics 2010, 95(6):315-327. 10.1016/j.ygeno.2010.03.001View Article
  30. Pop M: Genome assembly reborn: recent computational challenges. Brief. Bioinf 2009, 10(4):354-366. 10.1093/bib/bbp026View Article
  31. Alkan C, Sajjadian S, Eichler E: Limitations of next-generation genome sequence assembly. Nat. Methods 2010, 8: 61-65.View Article
  32. Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat. Methods 2009, 6: S6-S12. 10.1038/nmeth.1376View Article
  33. Mardis E: Next-generation DNA sequencing methods. Annu. Rev. Genom. Hum. Genet 2008, 9: 387-402. 10.1146/annurev.genom.9.081307.164359View Article
  34. Schatz M, Delcher A, Salzberg S: Assembly of large genomes using second-generation sequencing. Genome Res 2010, 20(9):1165. 10.1101/gr.101360.109View Article
  35. Pop M, Salzberg S: Bioinformatics challenges of new sequencing technology. Trends Genet 2008, 24(3):142-149. 10.1016/j.tig.2007.12.006View Article
  36. Pop M, Phillippy A, Delcher A, Salzberg S: Comparative genome assembly. Brief. Bioinf 2004, 5(3):237. 10.1093/bib/5.3.237View Article
  37. Kurtz S, Phillippy A, Delcher A, Smoot M, Shumway M, Antonescu C, Salzberg S: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12. 10.1186/gb-2004-5-2-r12View Article
  38. Pop M, Kosack D, Salzberg S: Hierarchical scaffolding with Bambus. Genome Res 2004, 14: 149.View Article
  39. Salzberg S, Sommer D, Puiu D, Lee V: Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comput. Biol 2008, 4(9):e1000186. 10.1371/journal.pcbi.1000186MathSciNetView Article
  40. Schatz M, Langmead B, Salzberg S: Cloud computing and the DNA data race. Nat. Biotechnol 2010, 28(7):691. 10.1038/nbt0710-691View Article
  41. Gnerre S, Lander E, Lindblad-Toh K, Jaffe D: Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biol 2009, 10(8):R88. 10.1186/gb-2009-10-8-r88View Article
  42. Rissanen J: MDL denoising. IEEE Trans. Inf. Theory 2000, 46(7):2537-2543. 10.1109/18.887861MATHMathSciNetView Article
  43. Rissanen J: Hypothesis selection and testing by the MDL principle. Comput. J 1999, 42(4):260-269. 10.1093/comjnl/42.4.260MATHMathSciNetView Article
  44. Baxter R, Oliver J: MDL and MML: Similarities and Differences, vol. 207. Clayton, Victoria, Australia, Tech. Rep: Dept. Comput. Sci. Monash Univ; 1994.
  45. Adriaans P, Vitányi P: The power and perils of MDL. In IEEE International Symposium on Information Theory, ISIT. Nice, France; 2007:2216-2220.
  46. Rissanen J, Tabus I: Kolmogorov’s Structure function in MDL theory and lossy data compression Chap. 10 Adv. Min. Descrip. Length Theory Appl. 5 Cambridge Center, Cambridge, MA 02412: MIT Press; 2005.
  47. Grünwald P, Kontkanen P, Myllymäki P, Silander T, Tirri H: Minimum encoding approaches for predictive modeling. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc; 1998:183-192.
  48. Wajid B, Serpedin E: Minimum description length based selection of reference sequences for comparative assemblers. In 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). (San Antonio, TX, USA; 2011:230-233.View Article
  49. Silander T, Roos T, Kontkanen P, Myllymäki P: Factorized normalized maximum likelihood criterion for learning Bayesian network structures. In 4th European Workshop on Probabilistic Graphical Models, Hirtshals. (Denmark; 2008:257-264.
  50. Grunwald P: A tutorial introduction to the minimum description length principle. Arxiv preprint math/0406077 (2004)
  51. Oliver J, Hand D: Introduction to Minimum Encoding Inference. Dept. of Comp. Sc., Monash University, Clayton, Vic. 3168, Australia, Tech. Rep; 1994.
  52. Wallace C, Dowe D: Minimum message length and Kolmogorov complexity. Comput. J 1999, 42(4):270-283. 10.1093/comjnl/42.4.270MATHView Article
  53. Grünwald P: Minimum description length tutorial. In Advances in Minimum Description Length: Theory and Applications. 5 Cambridge Center, Cambridge, MA 02412: MIT Press; 2005:1-80.
  54. Barron A, Rissanen J, Yu B: The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 1998, 44(6):2743-2760. 10.1109/18.720554MATHMathSciNetView Article
  55. Xie Q, Barron A: Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Trans. Inf. Theory 2000, 46(2):431-445. 10.1109/18.825803MATHMathSciNetView Article
  56. De Rooij S, Grünwald P: An empirical study of minimum description length model selection with infinite parametric complexity. J. Math. Psychol 2006, 50(2):180-192. 10.1016/j.jmp.2005.11.008MATHView Article
  57. Roos T: Monte Carlo estimation of minimax regret with an application to MDL model selection. In IEEE Information Theory Workshop, 2008. ITW’08. (Porto, Portugal; 2008:284-288.View Article
  58. Yang Y: Minimax nonparametric classification. II. Model selection for adaptation. IEEE Trans. Inf. Theory 1999, 45(7):2285-2292. 10.1109/18.796369MATHView Article
  59. Rezaei F, Charalambous C: Robust coding for uncertain sources: a minimax approach. In IEEE Proceedings International Symposium on Information Theory, 2005. ISIT. (Adelaide, SA; 2005:1539-1543.View Article
  60. Suen G, Weimer P, Stevenson D, Aylward F, Boyum J, Deneke J, Drinkwater C, Ivanova N, Mikhailova N, Chertkov O, Goodwin L, Currie1 C, Mead D, Brumm P: The complete genome sequence of Fibrobacter succinogenes S85 reveals a cellulolytic and metabolic specialist. PloS one 2011, 6(4):e18814. 10.1371/journal.pone.0018814View Article
  61. Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis K: Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PloS one 2012, 7(2):e30087. 10.1371/journal.pone.0030087View Article
  62. Hattori M, Fujiyama A, Taylor T, Watanabe H, Yada T, Park H, Toyoda A, Ishii K, Totoki Y, Choi D: The DNA sequence of human chromosome 21. Nature 2000, 405(6784):311-319. 10.1038/35012518View Article
  63. Waterston R, Lander E, Sulston J: On the sequencing of the human genome. Proc. Natl. Acad. Sci 2002, 99(6):3712. 10.1073/pnas.042692499View Article
  64. Istrail S, Sutton G, Florea L, Halpern A, Mobarry C, Lippert R, Walenz B, Shatkay H, Dew I, Miller J: Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. US Am 2004, 101(7):1916. 10.1073/pnas.0307971100View Article
  65. Salzberg S, Sommer D, Puiu D, Lee V: Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comput. Biol 2008, 4(9):e1000186. 10.1371/journal.pcbi.1000186MathSciNetView Article
  66. Croucher N: From small reads do mighty genomes grow. Nature Rev. Microbiol 2009, 7(9):621-621. 10.1038/nrmicro2211View Article
  67. Huffman D: A method for the construction of minimum-redundancy codes. Proc. IRE 1952, 40(9):1098-1101.View Article
  68. Cover T, Thomas J, Wiley J: Elements of information theory, vol. 6. New York: Wiley InterScience; 1991.View Article
  69. Rabbani M, Jones P: Digital image compression techniques. Bellingham, Washington, vol. TT7: SPIE Publications; 1991.View Article
  70. Kieffer J: Data Compression. New York: Wiley InterScience; 1971.
  71. Fano R, Hawkins D: Transmission of information: a statistical theory of communications. Am. J. Phys 1961, 29: 793.View Article
  72. Cock P, Fields C, Goto N, Heuer M, Rice P: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010, 38(6):1767-1771. 10.1093/nar/gkp1137View Article
  73. Rodriguez-Ezpeleta N, Hackenberg M, Aransay A: Bioinformatics for High Throughput Sequencing. New York: Springer Verlag; 2011.
  74. Hoare C: Quicksort. Comput. J 1962, 5: 10. 10.1093/comjnl/5.1.10MATHMathSciNetView Article
  75. Kingston J: Algorithms and Data Structures: Design, Correctness, Analysis. Sydney: Addison-Wesley; 1990.
  76. Renegar K: Influenza virus infections and immunity: a review of human and animal models. Lab. Animal Sci 1992, 42(3):222.
  77. Myers K, Olsen C, Gray G: Cases of swine influenza in humans: a review of the literature. Clin. Infect. Diseases 2007, 44(8):1084. 10.1086/512813View Article
  78. Suarez D, Schultz-Cherry S: Immunology of avian influenza virus: a review. Develop. Comparat. Immunol 2000, 24(2–3):269-283.View Article
  79. Huang W, Li L, Myers JR, Marth GT: ART: a next-generation sequencing read simulator. Bioinf 2012, 28(4):593-594. 10.1093/bioinformatics/btr708View Article
  80. Wajid B, Serpedin E, Nounou M, Nounou H: MiB: a comparative assembly processing pipeline. In 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS’12). (Washington DC., USA; 2012.

Copyright

© Wajid et al.; licensee Springer. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.