Bioinformatics : problem solving paradigms

What is this book good for? Imagine you are a computer scientist working in the bioinformatics area. P- bably you will be a member of a highly interdisciplinary team consisting of biologists, chemists, mathematicians, computer scientists ranging from p- grammers to algorithm engineers, and eventually people from various further ?elds. A major problem within such interdisciplinary teams is always to ?nd some common language, and, for each member of some discipline, to have profound knowledge of what are the notions, basic concepts and goals of the other participating disciplines, as well as of what they can contribute to the solutionofonesownproblems. Thisdoes,ofcourse,notmeanthatacomputer scientist should do the job of the biologist. Nevertheless, a computer scientist should be able to understand what a biologist deals with. On the other hand, the biologist should not do the computer scientists job, but should know what computer science and algorithm engineering might contribute to thesolution of her/his problems, and also how problems should be stated in order for the computer scientist to understand them. This book primarily aims to show the potential that algorithm engin- ring o?ers for the solution of core bioinformatics problems.

1 Core Bioinformatics Problems	p. 1
1.1 DNA Mapping and Sequencing	p. 1
1.1.1 Size of Human DNA	p. 1
1.1.2 Copying DNA: Polymerase Chain Reaction (PCR)	p. 2
1.1.3 Hybridization and Microarrays	p. 2
1.1.4 Cutting DNA into Fragments	p. 2
1.1.5 Prefix Cutting	p. 3
1.1.6 Unique Markers	p. 3
1.1.7 Measuring the Length of DNA Molecules	p. 4
1.1.8 Sequencing Short DNA Molecules	p. 4
1.1.9 Mapping Long DNA Molecules	p. 4
1.1.10 Mapping by Single Digestion	p. 5
1.1.11 Mapping by Double Digestion	p. 6
1.1.12 Mid-sized DNA Molecules: Shotgun Sequencing	p. 7
1.2 String Storage and Pattern Matching	p. 8
1.2.1 Complexity of Pattern Matching in Bioinformatics	p. 8
1.2.2 First Look at Suffix Trees	p. 9
1.2.3 A Couple of Suffix Tree Applications	p. 10
1.3 String Alignment	p. 11
1.4 Multiple Alignment	p. 13
1.5 Gene Detection	p. 15
1.6 Gene Comparison	p. 16
1.7 RNA Structure Prediction	p. 18
1.8 Protein Structure Prediction	p. 19
1.8.1 Holy Grail Protein Structure Prediction	p. 19
1.8.2 Secondary Structure	p. 20
1.8.3 Tertiary Structure Prediction with Contact Maps	p. 20
1.8.4 HP-Model Approach	p. 21
1.8.5 Protein Threading	p. 22
1.9 Molecular Networks	p. 23
1.10 Bibliographic Remarks	p. 24
2 Turning to Algorithmic Problems	p. 25
2.1 Presentation of Algorithmic Problems	p. 25
2.2 DNA Mapping	p. 26
2.2.1 Mapping by Hybridization	p. 26
2.2.2 Mapping by Single Digestion	p. 27
2.2.3 Mapping by Double Digestion	p. 30
2.3 Shotgun Sequencing and Shortest Common Superstrings	p. 31
2.4 Exact Pattern Matching	p. 35
2.4.1 Naive Pattern Matching	p. 35
2.4.2 Knuth-Morris-Pratt Algorithm	p. 35
2.4.3 Multi-Pattern Search	p. 37
2.4.4 Multi-Pattern Search	p. 37
2.4.5 Formal Definition of Suffix Trees	p. 38
2.5 Soft Pattern Matching = Alignment	p. 41
2.5.1 Alignments Restated	p. 41
2.5.2 Evaluating Alignments by Scoring Functions	p. 41
2.6 Multiple Alignment	p. 43
2.6.1 Sum-of-Pairs Maximization	p. 43
2.6.2 Multiple Alignment Along a Guide Tree	p. 43
2.6.3 Consensus Line Optimization	p. 45
2.6.4 Steiner String	p. 45
2.6.5 Equivalence of Consensus Lines and Steiner Strings	p. 45
2.6.6 Generalization to Steiner Trees/Phylogenetic Alignment	p. 47
2.6.7 Profile Alignment	p. 47
2.6.8 Hidden Markov Multiple Alignment	p. 48
2.6.9 Example Heuristic for Obtaining Multiple Alignments	p. 51
2.7 Gene Detection	p. 52
2.8 Genome Rearrangement	p. 53
2.9 RNA Structure Prediction	p. 55
2.10 Protein Structure Prediction	p. 56
2.10.1 Comparative Approaches Using Neural Networks	p. 56
2.10.2 Protein Threading	p. 57
2.11 Bi-Clustering	p. 59
2.12 Bibliographic Remarks	p. 60
3 Dynamic Programming	p. 61
3.1 General Principle of Dynamic Programming	p. 61
3.2 Alignment	p. 63
3.2.1 Problem Restated	p. 63
3.2.2 Parameterization	p. 64
3.2.3 Bellman Principle	p. 64
3.2.4 Recursive Solution	p. 64
3.2.5 Number of Different Subcalls and Overall Complexity	p. 65
3.2.6 Tabular Organization of the Bottom-Up Computation	p. 65
3.3 Multiple Alignment	p. 67
3.4 Affine Gap Alignment	p. 67
3.4.1 Problem Restated	p. 67
3.4.2 Parameterization and Conditioning	p. 68
3.4.3 Bellman Principle	p. 68
3.4.4 Recursive Solution	p. 68
3.4.5 Number of Different Subcalls and Overall Complexity	p. 69
3.5 Exon Assembly	p. 69
3.5.1 Problem Restated	p. 69
3.5.2 Parameterization and Conditioning	p. 70
3.5.3 Bellman Principle	p. 70
3.5.4 Recursive Solution	p. 71
3.5.5 Number of Different Subcalls and Overall Complexity	p. 72
3.5.6 Preprocessing of Simple Terms	p. 72
3.5.7 Parallel Computation of Recursive Terms	p. 72
3.6 RNA Structure Prediction	p. 73
3.6.1 Problem Restated	p. 73
3.6.2 Parameterization and Conditioning	p. 74
3.6.3 Recursive Solution and Bellman Principle	p. 74
3.6.4 Number of Different Subcalls and Overall Complexity	p. 75
3.6.5 Variation: Free Energy Minimization	p. 76
3.7 Viterbi Algorithm	p. 83
3.7.1 Problem Restated	p. 83
3.7.2 Parameterization and Conditioning	p. 83
3.7.3 Bellman Principle	p. 83
3.7.4 Recursive Computation	p. 83
3.7.5 Counting and Overall Complexity	p. 84
3.8 Baum-Welch Algorithm	p. 84
3.8.1 Problem Motivation	p. 84
3.8.2 A Couple of Important 'Variables'	p. 84
3.8.3 Computing Forward Variables	p. 85
3.8.4 Computing Backward Variables	p. 85
3.8.5 Computing Transition Variables	p. 85
3.8.6 Computing State Variables	p. 85
3.8.7 Model Improvement	p. 86
3.9 Expressiveness of Viterbi Equations	p. 86
3.10 Lifted Phylogenetic Alignment	p. 91
3.10.1 Problem Restated	p. 91
3.10.2 Parameterization and Conditioning	p. 92
3.10.3 Bellman Principle	p. 92
3.10.4 Recursive Solution	p. 92
3.10.5 Counting and Overall Complexity	p. 93
3.10.6 Preprocessing of Simple Terms	p. 93
3.11 Protein Threading	p. 93
3.11.1 Problem Restated	p. 93
3.11.2 Parameterization and Conditioning	p. 94
3.11.3 Bellman Principle	p. 94
3.11.4 Recursive Solution	p. 94
3.11.5 Counting and Overall Complexity	p. 95
3.12 Bibliographic Remarks	p. 95
4 Intelligent Data Structures	p. 97
4.1 Representation Matters	p. 97
4.1.1 Intelligent Data Structures Used in Computer Science	p. 97
4.1.2 Covering a Truncated Board	p. 97
4.1.3 Choose Three Bits and Win	p. 98
4.1.4 Check a Proof	p. 99
4.2 PQ-Trees	p. 101
4.2.1 Basic Notions	p. 101
4.2.2 Transformation rules for marked PQ-Trees	p. 103
4.3 Suffix Trees	p. 111
4.3.1 Outline of Ukkonen's Algorithm and Basic Procedures	p. 111
4.3.2 Insertion Procedure	p. 112
4.3.3 Saving Single Character Insertions	p. 115
4.3.4 Saving Navigation Steps	p. 116
4.3.5 Estimation of the Overall Number of Visited Nodes	p. 117
4.3.6 Ukkonen's Theorem	p. 120
4.3.7 Common Suffix Tree for Several Strings	p. 123
4.3.8 Applications of Suffix Trees	p. 124
4.4 Least Common Ancestor	p. 131
4.4.1 Motivation	p. 131
4.4.2 Least Common Ancestor in Full Binary Trees	p. 131
4.4.3 Least Common Ancestor in Arbitrary Trees	p. 135
4.5 Signed Genome Rearrangement	p. 144
4.5.1 Reality-Desire Diagrams of Signed Permutations	p. 144
4.5.2 "'Ping-Pong"' of Lower and Upper Bounds	p. 148
4.5.3 Padding RD-Diagrams	p. 159
4.5.4 Sorting Padded Good Components	p. 161
4.5.5 Summary of the Sorting Procedure	p. 164
4.6 Bibliographic Remarks	p. 166
5 NP-Hardness of Core Bioinformatics Problems	p. 167
5.1 Getting Familiar with Basic Notions of Complexity Theory	p. 167
5.2 Cook's Theorem: Primer NP-Complete Problem	p. 171
5.3 "Zoo" of NP-Complete Problems	p. 175
5.4 Bridge to Bioinformatics: Shortest Common Super-Sequence	p. 188
5.5 NP-Completeness of Core Bioinformatics Problems	p. 198
5.5.1 Multiple Alignment	p. 198
5.5.2 Shortest Common Superstring	p. 201
5.5.3 Double Digest	p. 204
5.5.4 Protein Threading	p. 204
5.5.5 Bi-Clustering	p. 206
5.5.6 Pseudoknot Prediction	p. 209
5.5.7 The Picture of Presented Reductions	p. 213
5.5.8 Further NP-Complete Bioinformatics Problems	p. 213
5.6 Bibliographic Remarks	p. 215
6 Approximation Algorithms	p. 217
6.1 Basics of Approximation Algorithms	p. 217
6.1.1 What is an Approximation Algorithm?	p. 217
6.1.2 First Example: Metric Travelling Salesman	p. 217
6.1.3 Lower and Upper Bounding	p. 218
6.1.4 What are Approximation Algorithms good for?	p. 219
6.2 Shortest Common Superstring Problem	p. 221
6.2.1 Feasible Lower Bound	p. 221
6.2.2 Cycle Covers Related to Perfect Matchings Problem	p. 222
6.2.3 Computing Maximum Cost Perfect Matchings	p. 223
6.2.4 Cycle Cover Versus Cyclic Covering by Strings	p. 225
6.2.5 Some Combinatorial Statements on Cyclic Covering	p. 226
6.2.6 4-Approximation Algorithm	p. 228
6.2.7 Putting Things Together	p. 229
6.3 Steiner String	p. 230
6.3.1 Problem Restated	p. 230
6.3.2 Feasible Lower Bound	p. 230
6.3.3 2-Approximation Algorithm	p. 231
6.4 Phylogenetic Alignment	p. 232
6.5 Multiple Alignment	p. 234
6.5.1 Feasible Lower Bound	p. 234
6.5.2 2-Approximation Algorithm	p. 235
6.6 Sorting Unsigned Genomes	p. 236
6.6.1 Feasible Lower Bound	p. 236
6.6.2 4-Approximation Algorithm	p. 237
6.6.3 2-Approximation Algorithm	p. 239
6.7 HP-Model Protein Structure Prediction	p. 243
6.7.1 Feasible Upper Bound	p. 243
6.7.2 Asymptotic 4-Approximation Algorithm	p. 245
6.8 Bibliographic Remarks	p. 248
7 A Selection of Metaheuristics and Various Projects	p. 249
7.1 Multi-Layer Perceptron	p. 249
7.1.1 Architecture of Multi-Layer Perceptrons	p. 250
7.1.2 Forward Propagation of Input Signals	p. 251
7.1.3 Input and Output Encoding	p. 251
7.1.4 Backpropagation of Error Signals	p. 252
7.1.5 Tuning Backpropagation	p. 253
7.1.6 Generalization and Overfitting	p. 253
7.1.7 Application to Secondary Structure Prediction	p. 254
7.1.8 Further Applications	p. 254
7.2 Support Vector Machines	p. 255
7.2.1 Architecture of Support Vector Machines	p. 255
7.2.2 Margin Maximization	p. 256
7.2.3 Primal Problem	p. 257
7.2.4 Dual Problem	p. 258
7.2.5 Kernel Function Trick	p. 259
7.2.6 Generalization and Over-Fitting	p. 260
7.2.7 Application to Contact Map Prediction	p. 261
7.2.8 Soft Margin Maximization	p. 263
7.2.9 Semi-Supervised Support Vector Machines	p. 264
7.3 Hidden Markov Models	p. 265
7.3.1 Architecture of Hidden Markov Models	p. 266
7.3.2 Causes and Effects	p. 266
7.3.3 Bioinformatics Application: CG Islands	p. 268
7.3.4 Further Applications	p. 269
7.4 Genetic Algorithms	p. 269
7.4.1 Basic Characteristics of Genetic Algorithms	p. 269
7.4.2 Application to HP-Fold Optimization	p. 271
7.4.3 Further Applications	p. 272
7.5 Ant Colony Optimization	p. 272
7.5.1 Basic Characteristics of Ant Colony Algorithms	p. 273
7.5.2 Application to HP-Fold Optimization	p. 275
7.5.3 Further Applications	p. 276
7.6 Bibliographic Remarks	p. 276
References	p. 279
Index	p. 285

Available:*

On Order

Summary

Summary

Table of Contents