Fundamentals of data mining in genomics and proteomics

This book presents state-of-the-art analytical methods from statistics and data mining for the analysis of high-throughput data from genomics and proteomics. It adopts an approach focusing on concepts and applications and presents key analytical techniques for the analysis of genomics and proteomics data by detailing their underlying principles, merits and limitations.

Daniel Berrar and Martin Granzow and Werner DubitzkyKathleen F. KerrBenjamin M. BolstadKevin R. Coombes and Keith A. Baggerly and Jeffrey S. MorrisXiaochun Li and Jaroslaw HarezlakJoaquin DopazoMilos Hauskrecht and Richard Pelikan and Michal Valko and James Lyons-WeilerRichard SimonPeter Johansson and Markus RingnerCarlos Rodriguez-Caso and Ricard V. SoleOliver Bembom and Maya L. Petersen and Mark J. van der LaanRobert Hoffmann

1 Introduction to Genomic and Proteomic Data Analysis	p. 1
1.1 Introduction	p. 1
1.2 A Short Overview of Wet Lab Techniques	p. 3
1.2.1 Transcriptomics Techniques in a Nutshell	p. 3
1.2.2 Proteomics Techniques in a Nutshell	p. 5
1.3 A Few Words on Terminology	p. 6
1.4 Study Design	p. 7
1.5 Data Mining	p. 8
1.5.1 Mapping Scientific Questions to Analytical Tasks	p. 9
1.5.2 Visual Inspection	p. 11
1.5.3 Data Pre-Processing	p. 13
1.5.3.1 Handling of Missing Values	p. 13
1.5.3.2 Data Transformations	p. 14
1.5.4 The Problem of Dimensionality	p. 15
1.5.4.1 Mapping to Lower Dimensions	p. 15
1.5.4.2 Feature Selection and Significance Analysis	p. 16
1.5.4.3 Test Statistics for Discriminatory Features	p. 17
1.5.4.4 Multiple Hypotheses Testing	p. 19
1.5.4.5 Random Permutation Tests	p. 21
1.5.5 Predictive Model Construction	p. 22
1.5.5.1 Basic Measures of Performance	p. 24
1.5.5.2 Training, Validating, and Testing	p. 25
1.5.5.3 Data Resampling Strategies	p. 27
1.5.6 Statistical Significance Tests for Comparing Models	p. 29
1.6 Result Post-Processing	p. 31
1.6.1 Statistical Validation	p. 31
1.6.2 Epistemological Validation	p. 32
1.6.3 Biological Validation	p. 32
1.7 Conclusions	p. 32
References	p. 33
2 Design Principles for Microarray Investigations	p. 39
2.1 Introduction	p. 39
2.2 The "Pre-Planning" Stage	p. 39
2.2.1 Goal 1: Unsupervised Learning	p. 40
2.2.2 Goal 2: Supervised Learning	p. 41
2.2.3 Goal 3: Class Comparison	p. 41
2.3 Statistical Design Principles, Applied to Microarrays	p. 42
2.3.1 Replication	p. 42
2.3.2 Blocking	p. 43
2.3.3 Randomization	p. 46
2.4 Case Study	p. 47
2.5 Conclusions	p. 47
References	p. 48
3 Pre-Processing DNA Microarray Data	p. 51
3.1 Introduction	p. 51
3.1.1 Affymetrix GeneChips	p. 53
3.1.2 Two-Color Microarrays	p. 55
3.2 Basic Concepts	p. 55
3.2.1 Pre-Processing Affymetrix GeneChip Data	p. 56
3.2.2 Pre-Processing Two-Color Microarray Data	p. 59
3.3 Advantages and Disadvantages	p. 62
3.3.1 Affymetrix GeneChip Data	p. 62
3.3.1.1 Advantages	p. 62
3.3.1.2 Disadvantages	p. 62
3.3.2 Two-Color Microarrays	p. 62
3.3.2.1 Advantages	p. 62
3.3.2.2 Disadvantages	p. 63
3.4 Caveats and Pitfalls	p. 63
3.5 Alternatives	p. 63
3.5.1 Affymetrix GeneChip Data	p. 63
3.5.2 Two-Color Microarrays	p. 64
3.6 Case Study	p. 64
3.6.1 Pre-Processing an Affymetrix GeneChip Data Set	p. 64
3.6.2 Pre-Processing a Two-Channel Microarray Data Set	p. 69
3.7 Lessons Learned	p. 73
3.8 List of Tools and Resources	p. 74
3.9 Conclusions	p. 74
3.10 Mathematical Details	p. 74
3.10.1 RMA Background Correction Equation	p. 74
3.10.2 Quantile Normalization	p. 75
3.10.3 RMA Model	p. 75
3.10.4 Quality Assessment Statistics	p. 75
3.10.5 Computation of M and A Values for Two-Channel Microarray Data	p. 76
3.10.6 Print-Tip Loess Normalization	p. 76
References	p. 76
4 Pre-Processing Mass Spectrometry Data	p. 79
4.1 Introduction	p. 79
4.2 Basic Concepts	p. 82
4.3 Advantages and Disadvantages	p. 83
4.4 Caveats and Pitfalls	p. 87
4.5 Alternatives	p. 89
4.6 Case Study: Experimental and Simulated Data Sets for Comparing Pre-Processing Methods	p. 92
4.7 Lessons Learned	p. 98
4.8 List of Tools and Resources	p. 98
4.9 Conclusions	p. 99
References	p. 99
5 Visualization in Genomics and Proteomics	p. 103
5.1 Introduction	p. 103
5.2 Basic Concepts	p. 105
5.2.1 Metric Scaling	p. 107
5.2.2 Nonmetric Scaling	p. 109
5.3 Advantages and Disadvantages	p. 109
5.4 Caveats and Pitfalls	p. 110
5.5 Alternatives	p. 112
5.6 Case Study: MDS on Mass Spectrometry Data	p. 113
5.7 Lessons Learned	p. 118
5.8 List of Tools and Resources	p. 119
5.9 Conclusions	p. 120
References	p. 121
6 Clustering - Class Discovery in the Post-Genomic Era	p. 123
6.1 Introduction	p. 123
6.2 Basic Concepts	p. 126
6.2.1 Distance Metrics	p. 126
6.2.2 Clustering Methods	p. 127
6.2.2.1 Aggregative Hierarchical Clustering	p. 128
6.2.2.2 k-Means	p. 129
6.2.2.3 Self-Organizing Maps	p. 130
6.2.2.4 Self-Organizing Tree Algorithm	p. 130
6.2.2.5 Model-Based Clustering	p. 130
6.2.3 Biclustering	p. 131
6.2.4 Validation Methods	p. 131
6.2.5 Functional Annotation	p. 132
6.3 Advantages and Disadvantages	p. 132
6.4 Caveats and Pitfalls	p. 134
6.4.1 On Distances	p. 135
6.4.2 On Clustering Methods	p. 135
6.5 Alternatives	p. 136
6.6 Case Study	p. 137
6.7 Lessons Learned	p. 139
6.8 List of Tools and Resources	p. 140
6.8.1 General Resources	p. 140
6.8.1.1 Multiple Purpose Tools (Including Clustering)	p. 140
6.8.2 Clustering Tools	p. 141
6.8.3 Biclustering Tools	p. 141
6.8.4 Time Series	p. 141
6.8.5 Public-Domain Statistical Packages and Other Tools	p. 141
6.8.6 Functional Analysis Tools	p. 142
6.9 Conclusions	p. 142
References	p. 143
7 Feature Selection and Dimensionality Reduction in Genomics and Proteomics	p. 149
7.1 Introduction	p. 149
7.2 Basic Concepts	p. 151
7.2.1 Filter Methods	p. 151
7.2.1.1 Criteria Based on Hypothesis Testing	p. 151
7.2.1.2 Permutation Tests	p. 152
7.2.1.3 Choosing Features Based on the Score	p. 153
7.2.1.4 Feature Set Selection and Controlling False Positives	p. 153
7.2.1.5 Correlation Filtering	p. 154
7.2.2 Wrapper Methods	p. 155
7.2.3 Embedded Methods	p. 155
7.2.3.1 Regularization/Shrinkage Methods	p. 155
7.2.3.2 Support Vector Machines	p. 156
7.2.4 Feature Construction	p. 156
7.2.4.1 Clustering	p. 156
7.2.4.2 Clustering Algorithms	p. 158
7.2.4.3 Probabilistic (Soft) Clustering	p. 158
7.2.4.4 Clustering Features	p. 158
7.2.4.5 Principal Component Analysis	p. 159
7.2.4.6 Discriminative Projections	p. 159
7.3 Advantages and Disadvantages	p. 160
7.4 Case Study: Pancreatic Cancer	p. 161
7.4.1 Data and Pre-Processing	p. 161
7.4.2 Filter Methods	p. 162
7.4.2.1 Basic Filter Methods	p. 162
7.4.2.2 Controlling False Positive Selections	p. 162
7.4.2.3 Correlation Filters	p. 164
7.4.3 Wrapper Methods	p. 165
7.4.4 Embedded Methods	p. 166
7.4.5 Feature Construction Methods	p. 167
7.4.6 Summary of Analysis Results and Recommendations	p. 168
7.5 Conclusions	p. 169
7.6 Mathematical Details	p. 169
References	p. 170
8 Resampling Strategies for Model Assessment and Selection	p. 173
8.1 Introduction	p. 173
8.2 Basic Concepts	p. 174
8.2.1 Resubstitution Estimate of Prediction Error	p. 174
8.2.2 Split-Sample Estimate of Prediction Error	p. 175
8.3 Resampling Methods	p. 176
8.3.1 Leave-One-Out Cross-Validation	p. 177
8.3.2 k-fold Cross-Validation	p. 178
8.3.3 Monte Carlo Cross-Validation	p. 178
8.3.4 Bootstrap Resampling	p. 179
8.3.4.1 The .632 Bootstrap	p. 179
8.3.4.2 The .632+ Bootstrap	p. 180
8.4 Resampling for Model Selection and Optimizing Tuning Parameters	p. 181
8.4.1 Estimating Statistical Significance of Classification Error Rates	p. 183
8.4.2 Comparison to Classifiers Based on Standard Prognostic Variables	p. 183
8.5 Comparison of Resampling Strategies	p. 184
8.6 Tools and Resources	p. 184
8.7 Conclusions	p. 185
References	p. 186
9 Classification of Genomic and Proteomic Data Using Support Vector Machines	p. 187
9.1 Introduction	p. 187
9.2 Basic Concepts	p. 187
9.2.1 Support Vector Machines	p. 188
9.2.2 Feature Selection	p. 190
9.2.3 Evaluating Predictive Performance	p. 191
9.3 Advantages and Disadvantages	p. 192
9.3.1 Advantages	p. 192
9.3.2 Disadvantages	p. 192
9.4 Caveats and Pitfalls	p. 192
9.5 Alternatives	p. 193
9.6 Case Study: Classification of Mass Spectral Serum Profiles Using Support Vector Machines	p. 193
9.6.1 Data Set	p. 193
9.6.2 Analysis Strategies	p. 194
9.6.2.1 Strategy A: SVM without Feature Selection	p. 196
9.6.2.2 Strategy B: SVM with Feature Selection	p. 196
9.6.2.3 Strategy C: SVM Optimized Using Test Samples Performance	p. 196
9.6.2.4 Strategy D: SVM with Feature Selection Using Test Samples	p. 196
9.6.3 Results	p. 196
9.7 Lessons Learned	p. 197
9.8 List of Tools and Resources	p. 197
9.9 Conclusions	p. 198
9.10 Mathematical Details	p. 198
References	p. 200
10 Networks in Cell Biology	p. 203
10.1 Introduction	p. 203
10.1.1 Protein Networks	p. 204
10.1.2 Metabolic Networks	p. 205
10.1.3 Transcriptional Regulation Maps	p. 205
10.1.4 Signal Transduction Pathways	p. 206
10.2 Basic Concepts	p. 206
10.2.1 Graph Definition	p. 206
10.2.2 Node Attributes	p. 207
10.2.3 Graph Attributes	p. 208
10.3 Caveats and Pitfalls	p. 212
10.4 Case Study: Topological Analysis of the Human Transcription Factor Interaction Network	p. 213
10.5 Lessons Learned	p. 218
10.6 List of Tools and Resources	p. 219
10.7 Conclusions	p. 220
10.8 Mathematical Details	p. 220
References	p. 221
11 Identifying Important Explanatory Variables for Time-Varying Outcomes	p. 227
11.1 Introduction	p. 227
11.2 Basic Concepts	p. 229
11.3 Advantages and Disadvantages	p. 233
11.3.1 Advantages	p. 233
11.3.2 Disadvantages	p. 234
11.4 Caveats and Pitfalls	p. 235
11.5 Alternatives	p. 237
11.6 Case Study: HIV Drug Resistance Mutations	p. 239
11.7 Lessons Learned	p. 245
11.8 List of Tools and Resources	p. 246
11.9 Conclusions	p. 247
References	p. 248
12 Text Mining in Genomics and Proteomics	p. 251
12.1 Introduction	p. 251
12.1.1 Text Mining	p. 251
12.1.2 Interactive Literature Exploration	p. 253
12.2 Basic Concepts	p. 253
12.2.1 Information Retrieval	p. 253
12.2.2 Entity Recognition	p. 254
12.2.3 Information Extraction	p. 254
12.2.4 Biomedical Text Resources	p. 255
12.2.5 Assessment and Comparison of Text Mining Methods	p. 256
12.3 Caveats and Pitfalls	p. 256
12.3.1 Entity Recognition	p. 256
12.3.2 Full Text	p. 257
12.3.3 Distribution of Information	p. 257
12.3.4 The Impossible	p. 258
12.3.5 Overall Performance	p. 258
12.4 Alternatives	p. 259
12.4.1 Functional Coherence Analysis of Gene Groups	p. 259
12.4.2 Co-Occurrence Networks	p. 260
12.4.3 Superimposition of Experimental Data to the Literature Network	p. 260
12.4.4 Gene Ontologies	p. 261
12.5 Case Study	p. 261
12.6 Lessons Learned	p. 265
12.7 List of Tools and Resources	p. 266
12.8 Conclusion	p. 266
12.9 Mathematical Details	p. 270
References	p. 270
Index	p. 275

Available:*

On Order

Summary

Summary

Table of Contents