Cover image for Protein bioinformatics : an algorithmic approach to sequence and structure analysis
Title:
Protein bioinformatics : an algorithmic approach to sequence and structure analysis
Personal Author:
Publication Information:
Chichester, West Sussex : John Wiley & Sons, 2004
ISBN:
9780470848395

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010063553 QD431.25.S85 E42 2004 Open Access Book Book
Searching...

On Order

Summary

Summary

Genomics and bioinformatics play an increasingly important and transformative role in medicine, society and agriculture. The mapping of the human genome has revealed 35,000 or so genes which might code for more than one protein, resulting in 100,000 proteins for the humans alone. Since proteins are attractive targets for developing drugs, efforts are now underway to map sequences and assign functions to many novel proteins. This book takes the novel approach to cover both the sequence and structure analysis of proteins in one volume and from an algorithmic perspective.

Key features of the book include:

Provides a comprehensive introduction to the analysis of protein sequence and structure analysis. Takes an algorithmic approach, relying on computational methods rather than theoretical. Provides an integrated presentation of theory, examples, exercises and applications. Includes coverage of both protein structure, and sequence, analysis. Accessible enough for biologists, yet rigorous enough for computer scientists and mathematicians. Supported by a Web site featuring exercises, solutions, images, and computer programs.

Visit this website for exercises with solutions, computer programs, errata and additional material:

http://www.ii.uib.no/proteinbioinformatics/


Author Notes

Ingvar Eidhammer and Inge Jonassen: Department of Informatics, University of Bergen, Norway
William R. Taylor: Division of Mathematical Biology, National Institute for Medical Research, London, UK


Table of Contents

Prefacep. xiii
Acknowledgementsp. xix
Part I Sequence Analysis
1 Pairwise Global Alignment of Sequencesp. 3
1.1 Alignment and Evolutionp. 4
1.2 What is an Alignment?p. 6
1.3 A Scoring Scheme for the Modelp. 6
1.4 Finding Highest-Scoring Alignments with Dynamic Programmingp. 7
1.4.1 Determine H[subscript i,j]p. 8
1.4.2 Use of matricesp. 9
1.4.3 Finding the alignments that give the highest scorep. 10
1.4.4 Gapsp. 13
1.5 Scoring Matricesp. 13
1.6 Scoring Gaps: Gap Penaltiesp. 14
1.7 Dynamic Programming for General Gap Penaltyp. 17
1.8 Dynamic Programming for Affine Gap Penaltyp. 18
1.9 Alignment Score and Sequence Distancep. 20
1.10 Exercisesp. 22
1.11 Bibliographic notesp. 23
2 Pairwise Local Alignment and Database Searchp. 25
2.1 The Basic Operation: Comparing Two Sequencesp. 26
2.2 Dot Matricesp. 27
2.2.1 Filteringp. 28
2.2.2 Repeating segmentsp. 30
2.3 Dynamic Programmingp. 31
2.3.1 Initializationp. 33
2.3.2 Finding the best local alignmentsp. 33
2.3.3 Algorithmsp. 33
2.3.4 Scoring matrices and gap penaltiesp. 34
2.4 Database Search: BLASTp. 34
2.4.1 The procedurep. 36
2.4.2 Preprocess the query: make the word listp. 37
2.4.3 Scanning the database sequencesp. 38
2.4.4 Extending to HSPp. 40
2.4.5 Introducing gapsp. 40
2.4.6 Algorithmp. 42
2.5 Exercisesp. 43
2.6 Bibliographic notesp. 45
3 Statistical Analysisp. 47
3.1 Hypothesis Testing for Sequence Homologyp. 47
3.1.1 Random generation of sequencesp. 48
3.1.2 Use of Z values for estimating the statistical significancep. 50
3.2 Statistical Distributionsp. 51
3.2.1 Poisson probability distributionp. 51
3.2.2 Extreme value distributionsp. 52
3.3 Theoretical Analysis of Statistical Significancep. 53
3.3.1 The P value has an extreme value distributionp. 55
3.3.2 Theoretical analysis for database searchp. 56
3.4 Probability Distributions for Gapped Alignmentsp. 57
3.5 Assessing and Comparing Programs for Database Searchp. 58
3.5.1 Sensitivity and specificityp. 59
3.5.2 Discrimination powerp. 60
3.5.3 Using more sequences as queriesp. 62
3.6 Exercisesp. 62
3.7 Bibliographic notesp. 64
4 Multiple Global Alignment and Phylogenetic Treesp. 65
4.1 Dynamic Programmingp. 65
4.1.1 SP score of multiple alignmentsp. 67
4.1.2 A pruning algorithm for the DP solutionp. 69
4.2 Multiple Alignments and Phylogenetic Treesp. 72
4.3 Phylogenyp. 74
4.3.1 The number of different tree topologiesp. 76
4.3.2 Molecular clock theoryp. 77
4.3.3 Additive and ultrametric treesp. 77
4.3.4 Different approaches for reconstructing phylogenetic treesp. 79
4.3.5 Distance-based constructionp. 81
4.3.6 Rooting of treesp. 85
4.3.7 Statistical test: bootstrappingp. 86
4.4 Progressive Alignmentp. 87
4.4.1 Aligning two subset alignmentsp. 88
4.4.2 Clusteringp. 90
4.4.3 Sequence weightsp. 93
4.4.4 CLUSTALp. 95
4.5 Other Approachesp. 97
4.6 Exercisesp. 97
4.7 Bibliographic notesp. 100
5 Scoring Matricesp. 101
5.1 Scoring Matrices Based on Physio-Chemical Propertiesp. 102
5.2 PAM Scoring Matricesp. 103
5.2.1 The evolutionary modelp. 104
5.2.2 Calculate substitution matrixp. 104
5.2.3 Matrices for general evolutionary timep. 107
5.2.4 Measuring sequence similarity by use of M[superscript tau]p. 109
5.2.5 Odds matricesp. 109
5.2.6 Scoring matrices (log-odds matrices)p. 111
5.2.7 Estimating the evolutionary distancep. 111
5.3 BLOSUm Scoring Matricesp. 113
5.3.1 Log-odds matrixp. 114
5.3.2 Developing scoring matrices for different evolutionary distancesp. 115
5.4 Comparing BLOSUM and PAM Matricesp. 117
5.5 Optimal Scoring Matricesp. 119
5.5.1 Analysis for one sequencep. 119
5.6 Exercisesp. 120
5.7 Bibliographic notesp. 122
6 Profilesp. 123
6.1 Constructing a Profilep. 124
6.1.1 Notationp. 126
6.1.2 Removing rows and columnsp. 127
6.1.3 Position weightsp. 127
6.1.4 Sequence weightsp. 129
6.1.5 Treating gapsp. 129
6.2 Searching Databases with Profilesp. 130
6.3 Iterated BLAST: PSI-BLASTp. 131
6.3.1 Making the multiple alignmentp. 132
6.3.2 Constructing the profilep. 132
6.4 HMM Profilep. 134
6.4.1 Definitions for an HMMp. 134
6.4.2 Constructing a profile HMM for a protein familyp. 135
6.4.3 Comparing a sequence with an HMMp. 137
6.4.4 Protein family databasesp. 137
6.5 Exercisesp. 137
6.6 Bibliographic notesp. 139
7 Sequence Patternsp. 141
7.1 The PROSITE Languagep. 142
7.2 Exact/Approximate Matchingp. 143
7.3 Defining Pattern Classes by Imposing Constraintsp. 144
7.4 Pattern Scoring: Information Theoryp. 145
7.4.1 Information theoryp. 145
7.4.2 Scoring patternsp. 147
7.5 Generalization and Specializationp. 148
7.6 Pattern Discovery: Introductionp. 148
7.7 Comparison-Based Methodsp. 151
7.7.1 Pivot-based methodsp. 152
7.7.2 Tree progressive methodsp. 153
7.8 Pattern-Driven Methods: Prattp. 154
7.8.1 The main procedurep. 155
7.8.2 Preprocessingp. 156
7.8.3 The pattern spacep. 156
7.8.4 Searchingp. 156
7.8.5 Ambiguous componentsp. 159
7.8.6 Specializationp. 160
7.8.7 Pattern scoringp. 160
7.9 Exercisesp. 160
7.10 Bibliographic notesp. 162
Part II Structure Analysis
8 Structures and Structure Descriptionsp. 165
8.1 Units of Structure Descriptionsp. 167
8.2 Coordinatesp. 168
8.3 Distance Matricesp. 168
8.4 Torsion Anglesp. 173
8.5 Coarse Level Descriptionp. 174
8.5.1 Line segments (sticks)p. 174
8.5.2 Ellipsoidp. 175
8.5.3 Helicesp. 176
8.5.4 Strands and sheetsp. 177
8.5.5 Topology of Protein Structure (TOPS)p. 177
8.6 Identifying the SSEsp. 178
8.6.1 Use of distance matricesp. 178
8.6.2 Define Secondary Structure of Proteins (DSSP)p. 179
8.7 Structure Comparisonp. 182
8.7.1 Structure descriptions for comparisonp. 183
8.7.2 Structure representationp. 185
8.8 Framework for Pairwise Structure Comparisonp. 187
8.9 Exercisesp. 189
8.10 Bibliographic notesp. 191
9 Superposition and Dynamic Programmingp. 193
9.1 Superpositionp. 193
9.1.1 Coordinate RMSDp. 193
9.1.2 Distance RMSDp. 195
9.1.3 Using RMSD as scoring of structure similaritiesp. 196
9.2 Alternating Superposition and Alignmentp. 196
9.3 Double Dynamic Programmingp. 199
9.3.1 Low-level scoring matricesp. 201
9.3.2 High-level scoring matrixp. 203
9.3.3 Iterated double dynamic programmingp. 203
9.4 Similarity of the Methodsp. 206
9.5 Exercisesp. 206
9.6 Bibliographic notesp. 210
10 Geometric Techniquesp. 211
10.1 Geometric Hashingp. 211
10.1.1 Two-dimensional geometric hashingp. 211
10.1.2 Geometric hashing for structure comparisonp. 216
10.1.3 Geometric hashing for SSE representationp. 219
10.1.4 Clusteringp. 220
10.2 Distance Matricesp. 221
10.2.1 Measuring the similarity of distance (sub)matricesp. 224
10.3 Exercisesp. 227
10.4 Bibliographic notesp. 228
11 Clustering: Combining Local Similaritiesp. 229
11.1 Compatibility and Consistencyp. 229
11.2 Searching for Seed Matchesp. 232
11.3 Consistencyp. 232
11.3.1 Test for consistencyp. 232
11.3.2 Overlapping clustersp. 234
11.4 Clustering Algorithmsp. 235
11.4.1 Linear clusteringp. 235
11.4.2 Hierarchical clusteringp. 236
11.5 Clustering by Use of Transformationsp. 237
11.5.1 Comparing transformationsp. 237
11.5.2 Calculating the new transformationp. 241
11.5.3 Algorithmp. 241
11.6 Clustering by Use of Relationsp. 243
11.6.1 How many relations to compare?p. 244
11.6.2 Geometric relationp. 244
11.6.3 Distance relationp. 245
11.6.4 Use of graph theoryp. 246
11.7 Refinementp. 248
11.8 Exercisesp. 248
11.9 Bibliographic notesp. 250
12 Significance and Assessment of Structure Comparisonsp. 253
12.1 Constructing Random Structure Modelsp. 253
12.1.1 Use of distance geometryp. 254
12.2 Use of Structure Databasesp. 255
12.2.1 Constructing nonredundant subsetsp. 255
12.2.2 Demarcation line for similarityp. 255
12.3 Reversing the Protein Chainp. 255
12.4 Randomized Alignment Modelsp. 257
12.5 Assessing Comparison and Scoring Methodsp. 257
12.6 Is RMSD Suitable for Scoring?p. 258
12.7 Scoring and Biological Significancep. 259
12.8 Exercisesp. 259
12.9 Bibliographic notesp. 260
13 Multiple Structure Comparisonp. 261
13.1 Multiple Superpositionp. 261
13.2 Progressive Structure Alignmentp. 263
13.2.1 Scoringp. 265
13.2.2 Construction of consensusp. 266
13.3 Finding a Common Core from a Multiple Alignmentp. 266
13.4 Discovering Common Coresp. 267
13.4.1 Finding the multiple seed matchesp. 268
13.4.2 Pairwise clusteringp. 269
13.4.3 Determining common coresp. 269
13.4.4 Scoring clustersp. 271
13.5 Local Structure Patternsp. 271
13.5.1 Local packing patternsp. 272
13.5.2 Discovering packing patternsp. 273
13.5.3 The approachp. 273
13.5.4 Scoring the packing motifsp. 276
13.6 Exercisesp. 276
13.7 Bibliographic notesp. 278
14 Protein Structure Classificationp. 279
14.1 Protein Domainsp. 279
14.2 An Ising Model for Domain Identificationp. 281
14.3 Domain Classesp. 283
14.3.1 Mainly-[alpha] domainsp. 283
14.3.2 Mainly-[beta] domainsp. 284
14.3.3 [alpha]-[beta] domainsp. 285
14.4 Foldsp. 285
14.5 Automatic Approaches to Classificationp. 285
14.6 Databases for Structure Classificationp. 286
14.7 FSSP-Dali Domain Dictionaryp. 286
14.8 CATHp. 288
14.8.1 Domainsp. 288
14.8.2 Classp. 288
14.8.3 Architecturep. 289
14.8.4 Topology (fold family)p. 289
14.8.5 Homologous superfamilyp. 290
14.8.6 Sequence familiesp. 290
14.8.7 The CATH classification procedurep. 290
14.9 Classification Based on Sticksp. 291
14.10 Exercisesp. 291
14.11 Bibliographic notesp. 292
Part III Sequence-Structureanalysis
15 Structure Prediction: Threadingp. 297
15.1 Protein Secondary Structure Predictionp. 298
15.1.1 Artificial neural networksp. 298
15.1.2 The PHD programp. 300
15.1.3 Accuracy in secondary structure predictionp. 301
15.2 Threadingp. 301
15.3 Methods Based on Sequence Alignmentp. 301
15.3.1 The 3D-1D matching methodp. 302
15.3.2 The FUGUE methodp. 303
15.4 Methods Using 3D Interactionsp. 303
15.4.1 Potentials of mean forcep. 305
15.4.2 Towards modelling methodsp. 307
15.5 Alignment Methodsp. 307
15.5.1 Frozen approximationp. 307
15.5.2 Double Dynamic Programmingp. 309
15.6 Multiple Sequence/Structure Threadingp. 310
15.6.1 Simple multiple sequence threadingp. 311
15.7 Combined Sequence/Threading Methodsp. 311
15.8 Assessment of Threading Methodsp. 311
15.8.1 Fold recognitionp. 312
15.8.2 Alignment accuracyp. 312
15.8.3 CASP and CAFASPp. 313
15.9 Bibliographic notesp. 313
Appendix A Basics in Mathematics, Probability and Algorithmsp. 315
A.1 Mathematical Formulae and Notationp. 315
A.2 Boolean Algebrap. 316
A.3 Set Theoryp. 316
A.4 Probabilityp. 317
A.4.1 Permutation and combinationp. 317
A.4.2 Probability distributionsp. 318
A.4.3 Expected valuep. 318
A.5 Tables, Vectors and Matricesp. 319
A.6 Algorithmic Languagep. 319
A.6.1 Alternativesp. 319
A.6.2 Loopsp. 320
A.7 Complexityp. 320
Appendix B Introduction to Molecular Biologyp. 323
B.1 The Cell and the Molecules of Life: DNA-RNA Proteinsp. 323
B.2 Chromosomes and Genesp. 326
B.3 The Central Dogma of Molecular Biologyp. 327
B.4 The Genetic Codep. 327
B.5 Protein Functionp. 329
B.5.1 The gene ontologyp. 330
B.6 Protein Structurep. 330
B.7 Evolutionp. 334
B.8 Insulin Examplep. 335
B.9 Bibliographic notesp. 336
Referencesp. 337
Indexp. 349