Protein bioinformatics : an algorithmic approach to sequence and structure analysis

Genomics and bioinformatics play an increasingly important and transformative role in medicine, society and agriculture. The mapping of the human genome has revealed 35,000 or so genes which might code for more than one protein, resulting in 100,000 proteins for the humans alone. Since proteins are attractive targets for developing drugs, efforts are now underway to map sequences and assign functions to many novel proteins. This book takes the novel approach to cover both the sequence and structure analysis of proteins in one volume and from an algorithmic perspective.

Key features of the book include:

Provides a comprehensive introduction to the analysis of protein sequence and structure analysis. Takes an algorithmic approach, relying on computational methods rather than theoretical. Provides an integrated presentation of theory, examples, exercises and applications. Includes coverage of both protein structure, and sequence, analysis. Accessible enough for biologists, yet rigorous enough for computer scientists and mathematicians. Supported by a Web site featuring exercises, solutions, images, and computer programs.

Visit this website for exercises with solutions, computer programs, errata and additional material:

http://www.ii.uib.no/proteinbioinformatics/

Author Notes

Ingvar Eidhammer and Inge Jonassen: Department of Informatics, University of Bergen, Norway
William R. Taylor: Division of Mathematical Biology, National Institute for Medical Research, London, UK

Preface	p. xiii
Acknowledgements	p. xix
Part I Sequence Analysis
1 Pairwise Global Alignment of Sequences	p. 3
1.1 Alignment and Evolution	p. 4
1.2 What is an Alignment?	p. 6
1.3 A Scoring Scheme for the Model	p. 6
1.4 Finding Highest-Scoring Alignments with Dynamic Programming	p. 7
1.4.1 Determine H[subscript i,j]	p. 8
1.4.2 Use of matrices	p. 9
1.4.3 Finding the alignments that give the highest score	p. 10
1.4.4 Gaps	p. 13
1.5 Scoring Matrices	p. 13
1.6 Scoring Gaps: Gap Penalties	p. 14
1.7 Dynamic Programming for General Gap Penalty	p. 17
1.8 Dynamic Programming for Affine Gap Penalty	p. 18
1.9 Alignment Score and Sequence Distance	p. 20
1.10 Exercises	p. 22
1.11 Bibliographic notes	p. 23
2 Pairwise Local Alignment and Database Search	p. 25
2.1 The Basic Operation: Comparing Two Sequences	p. 26
2.2 Dot Matrices	p. 27
2.2.1 Filtering	p. 28
2.2.2 Repeating segments	p. 30
2.3 Dynamic Programming	p. 31
2.3.1 Initialization	p. 33
2.3.2 Finding the best local alignments	p. 33
2.3.3 Algorithms	p. 33
2.3.4 Scoring matrices and gap penalties	p. 34
2.4 Database Search: BLAST	p. 34
2.4.1 The procedure	p. 36
2.4.2 Preprocess the query: make the word list	p. 37
2.4.3 Scanning the database sequences	p. 38
2.4.4 Extending to HSP	p. 40
2.4.5 Introducing gaps	p. 40
2.4.6 Algorithm	p. 42
2.5 Exercises	p. 43
2.6 Bibliographic notes	p. 45
3 Statistical Analysis	p. 47
3.1 Hypothesis Testing for Sequence Homology	p. 47
3.1.1 Random generation of sequences	p. 48
3.1.2 Use of Z values for estimating the statistical significance	p. 50
3.2 Statistical Distributions	p. 51
3.2.1 Poisson probability distribution	p. 51
3.2.2 Extreme value distributions	p. 52
3.3 Theoretical Analysis of Statistical Significance	p. 53
3.3.1 The P value has an extreme value distribution	p. 55
3.3.2 Theoretical analysis for database search	p. 56
3.4 Probability Distributions for Gapped Alignments	p. 57
3.5 Assessing and Comparing Programs for Database Search	p. 58
3.5.1 Sensitivity and specificity	p. 59
3.5.2 Discrimination power	p. 60
3.5.3 Using more sequences as queries	p. 62
3.6 Exercises	p. 62
3.7 Bibliographic notes	p. 64
4 Multiple Global Alignment and Phylogenetic Trees	p. 65
4.1 Dynamic Programming	p. 65
4.1.1 SP score of multiple alignments	p. 67
4.1.2 A pruning algorithm for the DP solution	p. 69
4.2 Multiple Alignments and Phylogenetic Trees	p. 72
4.3 Phylogeny	p. 74
4.3.1 The number of different tree topologies	p. 76
4.3.2 Molecular clock theory	p. 77
4.3.3 Additive and ultrametric trees	p. 77
4.3.4 Different approaches for reconstructing phylogenetic trees	p. 79
4.3.5 Distance-based construction	p. 81
4.3.6 Rooting of trees	p. 85
4.3.7 Statistical test: bootstrapping	p. 86
4.4 Progressive Alignment	p. 87
4.4.1 Aligning two subset alignments	p. 88
4.4.2 Clustering	p. 90
4.4.3 Sequence weights	p. 93
4.4.4 CLUSTAL	p. 95
4.5 Other Approaches	p. 97
4.6 Exercises	p. 97
4.7 Bibliographic notes	p. 100
5 Scoring Matrices	p. 101
5.1 Scoring Matrices Based on Physio-Chemical Properties	p. 102
5.2 PAM Scoring Matrices	p. 103
5.2.1 The evolutionary model	p. 104
5.2.2 Calculate substitution matrix	p. 104
5.2.3 Matrices for general evolutionary time	p. 107
5.2.4 Measuring sequence similarity by use of M[superscript tau]	p. 109
5.2.5 Odds matrices	p. 109
5.2.6 Scoring matrices (log-odds matrices)	p. 111
5.2.7 Estimating the evolutionary distance	p. 111
5.3 BLOSUm Scoring Matrices	p. 113
5.3.1 Log-odds matrix	p. 114
5.3.2 Developing scoring matrices for different evolutionary distances	p. 115
5.4 Comparing BLOSUM and PAM Matrices	p. 117
5.5 Optimal Scoring Matrices	p. 119
5.5.1 Analysis for one sequence	p. 119
5.6 Exercises	p. 120
5.7 Bibliographic notes	p. 122
6 Profiles	p. 123
6.1 Constructing a Profile	p. 124
6.1.1 Notation	p. 126
6.1.2 Removing rows and columns	p. 127
6.1.3 Position weights	p. 127
6.1.4 Sequence weights	p. 129
6.1.5 Treating gaps	p. 129
6.2 Searching Databases with Profiles	p. 130
6.3 Iterated BLAST: PSI-BLAST	p. 131
6.3.1 Making the multiple alignment	p. 132
6.3.2 Constructing the profile	p. 132
6.4 HMM Profile	p. 134
6.4.1 Definitions for an HMM	p. 134
6.4.2 Constructing a profile HMM for a protein family	p. 135
6.4.3 Comparing a sequence with an HMM	p. 137
6.4.4 Protein family databases	p. 137
6.5 Exercises	p. 137
6.6 Bibliographic notes	p. 139
7 Sequence Patterns	p. 141
7.1 The PROSITE Language	p. 142
7.2 Exact/Approximate Matching	p. 143
7.3 Defining Pattern Classes by Imposing Constraints	p. 144
7.4 Pattern Scoring: Information Theory	p. 145
7.4.1 Information theory	p. 145
7.4.2 Scoring patterns	p. 147
7.5 Generalization and Specialization	p. 148
7.6 Pattern Discovery: Introduction	p. 148
7.7 Comparison-Based Methods	p. 151
7.7.1 Pivot-based methods	p. 152
7.7.2 Tree progressive methods	p. 153
7.8 Pattern-Driven Methods: Pratt	p. 154
7.8.1 The main procedure	p. 155
7.8.2 Preprocessing	p. 156
7.8.3 The pattern space	p. 156
7.8.4 Searching	p. 156
7.8.5 Ambiguous components	p. 159
7.8.6 Specialization	p. 160
7.8.7 Pattern scoring	p. 160
7.9 Exercises	p. 160
7.10 Bibliographic notes	p. 162
Part II Structure Analysis
8 Structures and Structure Descriptions	p. 165
8.1 Units of Structure Descriptions	p. 167
8.2 Coordinates	p. 168
8.3 Distance Matrices	p. 168
8.4 Torsion Angles	p. 173
8.5 Coarse Level Description	p. 174
8.5.1 Line segments (sticks)	p. 174
8.5.2 Ellipsoid	p. 175
8.5.3 Helices	p. 176
8.5.4 Strands and sheets	p. 177
8.5.5 Topology of Protein Structure (TOPS)	p. 177
8.6 Identifying the SSEs	p. 178
8.6.1 Use of distance matrices	p. 178
8.6.2 Define Secondary Structure of Proteins (DSSP)	p. 179
8.7 Structure Comparison	p. 182
8.7.1 Structure descriptions for comparison	p. 183
8.7.2 Structure representation	p. 185
8.8 Framework for Pairwise Structure Comparison	p. 187
8.9 Exercises	p. 189
8.10 Bibliographic notes	p. 191
9 Superposition and Dynamic Programming	p. 193
9.1 Superposition	p. 193
9.1.1 Coordinate RMSD	p. 193
9.1.2 Distance RMSD	p. 195
9.1.3 Using RMSD as scoring of structure similarities	p. 196
9.2 Alternating Superposition and Alignment	p. 196
9.3 Double Dynamic Programming	p. 199
9.3.1 Low-level scoring matrices	p. 201
9.3.2 High-level scoring matrix	p. 203
9.3.3 Iterated double dynamic programming	p. 203
9.4 Similarity of the Methods	p. 206
9.5 Exercises	p. 206
9.6 Bibliographic notes	p. 210
10 Geometric Techniques	p. 211
10.1 Geometric Hashing	p. 211
10.1.1 Two-dimensional geometric hashing	p. 211
10.1.2 Geometric hashing for structure comparison	p. 216
10.1.3 Geometric hashing for SSE representation	p. 219
10.1.4 Clustering	p. 220
10.2 Distance Matrices	p. 221
10.2.1 Measuring the similarity of distance (sub)matrices	p. 224
10.3 Exercises	p. 227
10.4 Bibliographic notes	p. 228
11 Clustering: Combining Local Similarities	p. 229
11.1 Compatibility and Consistency	p. 229
11.2 Searching for Seed Matches	p. 232
11.3 Consistency	p. 232
11.3.1 Test for consistency	p. 232
11.3.2 Overlapping clusters	p. 234
11.4 Clustering Algorithms	p. 235
11.4.1 Linear clustering	p. 235
11.4.2 Hierarchical clustering	p. 236
11.5 Clustering by Use of Transformations	p. 237
11.5.1 Comparing transformations	p. 237
11.5.2 Calculating the new transformation	p. 241
11.5.3 Algorithm	p. 241
11.6 Clustering by Use of Relations	p. 243
11.6.1 How many relations to compare?	p. 244
11.6.2 Geometric relation	p. 244
11.6.3 Distance relation	p. 245
11.6.4 Use of graph theory	p. 246
11.7 Refinement	p. 248
11.8 Exercises	p. 248
11.9 Bibliographic notes	p. 250
12 Significance and Assessment of Structure Comparisons	p. 253
12.1 Constructing Random Structure Models	p. 253
12.1.1 Use of distance geometry	p. 254
12.2 Use of Structure Databases	p. 255
12.2.1 Constructing nonredundant subsets	p. 255
12.2.2 Demarcation line for similarity	p. 255
12.3 Reversing the Protein Chain	p. 255
12.4 Randomized Alignment Models	p. 257
12.5 Assessing Comparison and Scoring Methods	p. 257
12.6 Is RMSD Suitable for Scoring?	p. 258
12.7 Scoring and Biological Significance	p. 259
12.8 Exercises	p. 259
12.9 Bibliographic notes	p. 260
13 Multiple Structure Comparison	p. 261
13.1 Multiple Superposition	p. 261
13.2 Progressive Structure Alignment	p. 263
13.2.1 Scoring	p. 265
13.2.2 Construction of consensus	p. 266
13.3 Finding a Common Core from a Multiple Alignment	p. 266
13.4 Discovering Common Cores	p. 267
13.4.1 Finding the multiple seed matches	p. 268
13.4.2 Pairwise clustering	p. 269
13.4.3 Determining common cores	p. 269
13.4.4 Scoring clusters	p. 271
13.5 Local Structure Patterns	p. 271
13.5.1 Local packing patterns	p. 272
13.5.2 Discovering packing patterns	p. 273
13.5.3 The approach	p. 273
13.5.4 Scoring the packing motifs	p. 276
13.6 Exercises	p. 276
13.7 Bibliographic notes	p. 278
14 Protein Structure Classification	p. 279
14.1 Protein Domains	p. 279
14.2 An Ising Model for Domain Identification	p. 281
14.3 Domain Classes	p. 283
14.3.1 Mainly-[alpha] domains	p. 283
14.3.2 Mainly-[beta] domains	p. 284
14.3.3 [alpha]-[beta] domains	p. 285
14.4 Folds	p. 285
14.5 Automatic Approaches to Classification	p. 285
14.6 Databases for Structure Classification	p. 286
14.7 FSSP-Dali Domain Dictionary	p. 286
14.8 CATH	p. 288
14.8.1 Domains	p. 288
14.8.2 Class	p. 288
14.8.3 Architecture	p. 289
14.8.4 Topology (fold family)	p. 289
14.8.5 Homologous superfamily	p. 290
14.8.6 Sequence families	p. 290
14.8.7 The CATH classification procedure	p. 290
14.9 Classification Based on Sticks	p. 291
14.10 Exercises	p. 291
14.11 Bibliographic notes	p. 292
Part III Sequence-Structureanalysis
15 Structure Prediction: Threading	p. 297
15.1 Protein Secondary Structure Prediction	p. 298
15.1.1 Artificial neural networks	p. 298
15.1.2 The PHD program	p. 300
15.1.3 Accuracy in secondary structure prediction	p. 301
15.2 Threading	p. 301
15.3 Methods Based on Sequence Alignment	p. 301
15.3.1 The 3D-1D matching method	p. 302
15.3.2 The FUGUE method	p. 303
15.4 Methods Using 3D Interactions	p. 303
15.4.1 Potentials of mean force	p. 305
15.4.2 Towards modelling methods	p. 307
15.5 Alignment Methods	p. 307
15.5.1 Frozen approximation	p. 307
15.5.2 Double Dynamic Programming	p. 309
15.6 Multiple Sequence/Structure Threading	p. 310
15.6.1 Simple multiple sequence threading	p. 311
15.7 Combined Sequence/Threading Methods	p. 311
15.8 Assessment of Threading Methods	p. 311
15.8.1 Fold recognition	p. 312
15.8.2 Alignment accuracy	p. 312
15.8.3 CASP and CAFASP	p. 313
15.9 Bibliographic notes	p. 313
Appendix A Basics in Mathematics, Probability and Algorithms	p. 315
A.1 Mathematical Formulae and Notation	p. 315
A.2 Boolean Algebra	p. 316
A.3 Set Theory	p. 316
A.4 Probability	p. 317
A.4.1 Permutation and combination	p. 317
A.4.2 Probability distributions	p. 318
A.4.3 Expected value	p. 318
A.5 Tables, Vectors and Matrices	p. 319
A.6 Algorithmic Language	p. 319
A.6.1 Alternatives	p. 319
A.6.2 Loops	p. 320
A.7 Complexity	p. 320
Appendix B Introduction to Molecular Biology	p. 323
B.1 The Cell and the Molecules of Life: DNA-RNA Proteins	p. 323
B.2 Chromosomes and Genes	p. 326
B.3 The Central Dogma of Molecular Biology	p. 327
B.4 The Genetic Code	p. 327
B.5 Protein Function	p. 329
B.5.1 The gene ontology	p. 330
B.6 Protein Structure	p. 330
B.7 Evolution	p. 334
B.8 Insulin Example	p. 335
B.9 Bibliographic notes	p. 336
References	p. 337
Index	p. 349

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents