DNA, words and models : statistics of exceptional words

An important problem in computational biology is identifying short DNA sequences (mathematically, 'words') associated to a biological function. One approach consists in determining whether a particular word is simply random or is of statistical significance, for example, because of its frequency or location. This book introduces the mathematical and statistical ideas used in solving this so-called exceptional word problem. It begins with a detailed description of the principal models used in sequence analysis: Markovian models are central here and capture compositional information on the sequence being analysed. There follows an introduction to several statistical methods that are used for finding exceptional words with respect to the model used. The second half of the book is illustrated with numerous examples provided from the analysis of bacterial genomes, making this a practical guide for users facing a real situation and needing to make an adequate procedure choice.

List of figures	p. vii
List of tables	p. ix
Preface	p. xi
Preliminary notions and notations	p. xiv
1 Introduction	p. 1
1.1 The context	p. 1
1.2 Randomness and models	p. 3
1.3 A bit of biology	p. 6
2 Simple models for biological sequences	p. 11
2.1 Why a model?	p. 11
2.2 Permutation model	p. 12
2.3 Bernoulli model	p. 21
3 Introduction to Markov chain models	p. 27
3.1 Assumptions	p. 27
3.2 Markov chain of order 1	p. 28
3.3 Markov chain of order m	p. 31
3.4 Estimation of the parameters	p. 33
4 Taking heterogeneities into account	p. 39
4.1 Phased chains	p. 39
4.2 Piecewise homogeneous Markov chains	p. 43
4.3 Translation conditional models	p. 51
5 Statistical properties of word occurrences	p. 57
5.1 Count	p. 60
5.2 Positions and distances	p. 74
5.3 Distribution along the sequence	p. 89
6 Words with unexpected frequencies	p. 99
6.1 Exact distribution and approximations	p. 101
6.2 Influence of the model	p. 106
6.3 Over-representation of Chi sites in E. coli and H. influenzae	p. 112
6.4 Under-representation of palindromes of length 6 in E. coli and in the phage Lambda	p. 118
7 Words with unexpected locations	p. 123
7.1 Chi sites in the genome of H. influenzae	p. 123
7.2 Distribution of palindromes in E. coli's genome	p. 127
7.3 Detection of promoter sites in B. subtilis	p. 128
The last word	p. 131
References	p. 134
Index	p. 137

Available:*

On Order

Summary

Summary

Table of Contents