Cover image for Fundamentals of data mining in genomics and proteomics
Title:
Fundamentals of data mining in genomics and proteomics
Publication Information:
New York, NY : Springer-Verlag, 2007
Physical Description:
xxii, 281 p. : ill., digital ; 25 cm.
ISBN:
9780387475080

9780387475097
General Note:
Also available online version
Added Corporate Author:
Electronic Access:
Full Text

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010138257 QH452.7 F86 2007 Open Access Book Book
Searching...

On Order

Summary

Summary

This book presents state-of-the-art analytical methods from statistics and data mining for the analysis of high-throughput data from genomics and proteomics. It adopts an approach focusing on concepts and applications and presents key analytical techniques for the analysis of genomics and proteomics data by detailing their underlying principles, merits and limitations.


Table of Contents

Daniel Berrar and Martin Granzow and Werner DubitzkyKathleen F. KerrBenjamin M. BolstadKevin R. Coombes and Keith A. Baggerly and Jeffrey S. MorrisXiaochun Li and Jaroslaw HarezlakJoaquin DopazoMilos Hauskrecht and Richard Pelikan and Michal Valko and James Lyons-WeilerRichard SimonPeter Johansson and Markus RingnerCarlos Rodriguez-Caso and Ricard V. SoleOliver Bembom and Maya L. Petersen and Mark J. van der LaanRobert Hoffmann
1 Introduction to Genomic and Proteomic Data Analysisp. 1
1.1 Introductionp. 1
1.2 A Short Overview of Wet Lab Techniquesp. 3
1.2.1 Transcriptomics Techniques in a Nutshellp. 3
1.2.2 Proteomics Techniques in a Nutshellp. 5
1.3 A Few Words on Terminologyp. 6
1.4 Study Designp. 7
1.5 Data Miningp. 8
1.5.1 Mapping Scientific Questions to Analytical Tasksp. 9
1.5.2 Visual Inspectionp. 11
1.5.3 Data Pre-Processingp. 13
1.5.3.1 Handling of Missing Valuesp. 13
1.5.3.2 Data Transformationsp. 14
1.5.4 The Problem of Dimensionalityp. 15
1.5.4.1 Mapping to Lower Dimensionsp. 15
1.5.4.2 Feature Selection and Significance Analysisp. 16
1.5.4.3 Test Statistics for Discriminatory Featuresp. 17
1.5.4.4 Multiple Hypotheses Testingp. 19
1.5.4.5 Random Permutation Testsp. 21
1.5.5 Predictive Model Constructionp. 22
1.5.5.1 Basic Measures of Performancep. 24
1.5.5.2 Training, Validating, and Testingp. 25
1.5.5.3 Data Resampling Strategiesp. 27
1.5.6 Statistical Significance Tests for Comparing Modelsp. 29
1.6 Result Post-Processingp. 31
1.6.1 Statistical Validationp. 31
1.6.2 Epistemological Validationp. 32
1.6.3 Biological Validationp. 32
1.7 Conclusionsp. 32
Referencesp. 33
2 Design Principles for Microarray Investigationsp. 39
2.1 Introductionp. 39
2.2 The "Pre-Planning" Stagep. 39
2.2.1 Goal 1: Unsupervised Learningp. 40
2.2.2 Goal 2: Supervised Learningp. 41
2.2.3 Goal 3: Class Comparisonp. 41
2.3 Statistical Design Principles, Applied to Microarraysp. 42
2.3.1 Replicationp. 42
2.3.2 Blockingp. 43
2.3.3 Randomizationp. 46
2.4 Case Studyp. 47
2.5 Conclusionsp. 47
Referencesp. 48
3 Pre-Processing DNA Microarray Datap. 51
3.1 Introductionp. 51
3.1.1 Affymetrix GeneChipsp. 53
3.1.2 Two-Color Microarraysp. 55
3.2 Basic Conceptsp. 55
3.2.1 Pre-Processing Affymetrix GeneChip Datap. 56
3.2.2 Pre-Processing Two-Color Microarray Datap. 59
3.3 Advantages and Disadvantagesp. 62
3.3.1 Affymetrix GeneChip Datap. 62
3.3.1.1 Advantagesp. 62
3.3.1.2 Disadvantagesp. 62
3.3.2 Two-Color Microarraysp. 62
3.3.2.1 Advantagesp. 62
3.3.2.2 Disadvantagesp. 63
3.4 Caveats and Pitfallsp. 63
3.5 Alternativesp. 63
3.5.1 Affymetrix GeneChip Datap. 63
3.5.2 Two-Color Microarraysp. 64
3.6 Case Studyp. 64
3.6.1 Pre-Processing an Affymetrix GeneChip Data Setp. 64
3.6.2 Pre-Processing a Two-Channel Microarray Data Setp. 69
3.7 Lessons Learnedp. 73
3.8 List of Tools and Resourcesp. 74
3.9 Conclusionsp. 74
3.10 Mathematical Detailsp. 74
3.10.1 RMA Background Correction Equationp. 74
3.10.2 Quantile Normalizationp. 75
3.10.3 RMA Modelp. 75
3.10.4 Quality Assessment Statisticsp. 75
3.10.5 Computation of M and A Values for Two-Channel Microarray Datap. 76
3.10.6 Print-Tip Loess Normalizationp. 76
Referencesp. 76
4 Pre-Processing Mass Spectrometry Datap. 79
4.1 Introductionp. 79
4.2 Basic Conceptsp. 82
4.3 Advantages and Disadvantagesp. 83
4.4 Caveats and Pitfallsp. 87
4.5 Alternativesp. 89
4.6 Case Study: Experimental and Simulated Data Sets for Comparing Pre-Processing Methodsp. 92
4.7 Lessons Learnedp. 98
4.8 List of Tools and Resourcesp. 98
4.9 Conclusionsp. 99
Referencesp. 99
5 Visualization in Genomics and Proteomicsp. 103
5.1 Introductionp. 103
5.2 Basic Conceptsp. 105
5.2.1 Metric Scalingp. 107
5.2.2 Nonmetric Scalingp. 109
5.3 Advantages and Disadvantagesp. 109
5.4 Caveats and Pitfallsp. 110
5.5 Alternativesp. 112
5.6 Case Study: MDS on Mass Spectrometry Datap. 113
5.7 Lessons Learnedp. 118
5.8 List of Tools and Resourcesp. 119
5.9 Conclusionsp. 120
Referencesp. 121
6 Clustering - Class Discovery in the Post-Genomic Erap. 123
6.1 Introductionp. 123
6.2 Basic Conceptsp. 126
6.2.1 Distance Metricsp. 126
6.2.2 Clustering Methodsp. 127
6.2.2.1 Aggregative Hierarchical Clusteringp. 128
6.2.2.2 k-Meansp. 129
6.2.2.3 Self-Organizing Mapsp. 130
6.2.2.4 Self-Organizing Tree Algorithmp. 130
6.2.2.5 Model-Based Clusteringp. 130
6.2.3 Biclusteringp. 131
6.2.4 Validation Methodsp. 131
6.2.5 Functional Annotationp. 132
6.3 Advantages and Disadvantagesp. 132
6.4 Caveats and Pitfallsp. 134
6.4.1 On Distancesp. 135
6.4.2 On Clustering Methodsp. 135
6.5 Alternativesp. 136
6.6 Case Studyp. 137
6.7 Lessons Learnedp. 139
6.8 List of Tools and Resourcesp. 140
6.8.1 General Resourcesp. 140
6.8.1.1 Multiple Purpose Tools (Including Clustering)p. 140
6.8.2 Clustering Toolsp. 141
6.8.3 Biclustering Toolsp. 141
6.8.4 Time Seriesp. 141
6.8.5 Public-Domain Statistical Packages and Other Toolsp. 141
6.8.6 Functional Analysis Toolsp. 142
6.9 Conclusionsp. 142
Referencesp. 143
7 Feature Selection and Dimensionality Reduction in Genomics and Proteomicsp. 149
7.1 Introductionp. 149
7.2 Basic Conceptsp. 151
7.2.1 Filter Methodsp. 151
7.2.1.1 Criteria Based on Hypothesis Testingp. 151
7.2.1.2 Permutation Testsp. 152
7.2.1.3 Choosing Features Based on the Scorep. 153
7.2.1.4 Feature Set Selection and Controlling False Positivesp. 153
7.2.1.5 Correlation Filteringp. 154
7.2.2 Wrapper Methodsp. 155
7.2.3 Embedded Methodsp. 155
7.2.3.1 Regularization/Shrinkage Methodsp. 155
7.2.3.2 Support Vector Machinesp. 156
7.2.4 Feature Constructionp. 156
7.2.4.1 Clusteringp. 156
7.2.4.2 Clustering Algorithmsp. 158
7.2.4.3 Probabilistic (Soft) Clusteringp. 158
7.2.4.4 Clustering Featuresp. 158
7.2.4.5 Principal Component Analysisp. 159
7.2.4.6 Discriminative Projectionsp. 159
7.3 Advantages and Disadvantagesp. 160
7.4 Case Study: Pancreatic Cancerp. 161
7.4.1 Data and Pre-Processingp. 161
7.4.2 Filter Methodsp. 162
7.4.2.1 Basic Filter Methodsp. 162
7.4.2.2 Controlling False Positive Selectionsp. 162
7.4.2.3 Correlation Filtersp. 164
7.4.3 Wrapper Methodsp. 165
7.4.4 Embedded Methodsp. 166
7.4.5 Feature Construction Methodsp. 167
7.4.6 Summary of Analysis Results and Recommendationsp. 168
7.5 Conclusionsp. 169
7.6 Mathematical Detailsp. 169
Referencesp. 170
8 Resampling Strategies for Model Assessment and Selectionp. 173
8.1 Introductionp. 173
8.2 Basic Conceptsp. 174
8.2.1 Resubstitution Estimate of Prediction Errorp. 174
8.2.2 Split-Sample Estimate of Prediction Errorp. 175
8.3 Resampling Methodsp. 176
8.3.1 Leave-One-Out Cross-Validationp. 177
8.3.2 k-fold Cross-Validationp. 178
8.3.3 Monte Carlo Cross-Validationp. 178
8.3.4 Bootstrap Resamplingp. 179
8.3.4.1 The .632 Bootstrapp. 179
8.3.4.2 The .632+ Bootstrapp. 180
8.4 Resampling for Model Selection and Optimizing Tuning Parametersp. 181
8.4.1 Estimating Statistical Significance of Classification Error Ratesp. 183
8.4.2 Comparison to Classifiers Based on Standard Prognostic Variablesp. 183
8.5 Comparison of Resampling Strategiesp. 184
8.6 Tools and Resourcesp. 184
8.7 Conclusionsp. 185
Referencesp. 186
9 Classification of Genomic and Proteomic Data Using Support Vector Machinesp. 187
9.1 Introductionp. 187
9.2 Basic Conceptsp. 187
9.2.1 Support Vector Machinesp. 188
9.2.2 Feature Selectionp. 190
9.2.3 Evaluating Predictive Performancep. 191
9.3 Advantages and Disadvantagesp. 192
9.3.1 Advantagesp. 192
9.3.2 Disadvantagesp. 192
9.4 Caveats and Pitfallsp. 192
9.5 Alternativesp. 193
9.6 Case Study: Classification of Mass Spectral Serum Profiles Using Support Vector Machinesp. 193
9.6.1 Data Setp. 193
9.6.2 Analysis Strategiesp. 194
9.6.2.1 Strategy A: SVM without Feature Selectionp. 196
9.6.2.2 Strategy B: SVM with Feature Selectionp. 196
9.6.2.3 Strategy C: SVM Optimized Using Test Samples Performancep. 196
9.6.2.4 Strategy D: SVM with Feature Selection Using Test Samplesp. 196
9.6.3 Resultsp. 196
9.7 Lessons Learnedp. 197
9.8 List of Tools and Resourcesp. 197
9.9 Conclusionsp. 198
9.10 Mathematical Detailsp. 198
Referencesp. 200
10 Networks in Cell Biologyp. 203
10.1 Introductionp. 203
10.1.1 Protein Networksp. 204
10.1.2 Metabolic Networksp. 205
10.1.3 Transcriptional Regulation Mapsp. 205
10.1.4 Signal Transduction Pathwaysp. 206
10.2 Basic Conceptsp. 206
10.2.1 Graph Definitionp. 206
10.2.2 Node Attributesp. 207
10.2.3 Graph Attributesp. 208
10.3 Caveats and Pitfallsp. 212
10.4 Case Study: Topological Analysis of the Human Transcription Factor Interaction Networkp. 213
10.5 Lessons Learnedp. 218
10.6 List of Tools and Resourcesp. 219
10.7 Conclusionsp. 220
10.8 Mathematical Detailsp. 220
Referencesp. 221
11 Identifying Important Explanatory Variables for Time-Varying Outcomesp. 227
11.1 Introductionp. 227
11.2 Basic Conceptsp. 229
11.3 Advantages and Disadvantagesp. 233
11.3.1 Advantagesp. 233
11.3.2 Disadvantagesp. 234
11.4 Caveats and Pitfallsp. 235
11.5 Alternativesp. 237
11.6 Case Study: HIV Drug Resistance Mutationsp. 239
11.7 Lessons Learnedp. 245
11.8 List of Tools and Resourcesp. 246
11.9 Conclusionsp. 247
Referencesp. 248
12 Text Mining in Genomics and Proteomicsp. 251
12.1 Introductionp. 251
12.1.1 Text Miningp. 251
12.1.2 Interactive Literature Explorationp. 253
12.2 Basic Conceptsp. 253
12.2.1 Information Retrievalp. 253
12.2.2 Entity Recognitionp. 254
12.2.3 Information Extractionp. 254
12.2.4 Biomedical Text Resourcesp. 255
12.2.5 Assessment and Comparison of Text Mining Methodsp. 256
12.3 Caveats and Pitfallsp. 256
12.3.1 Entity Recognitionp. 256
12.3.2 Full Textp. 257
12.3.3 Distribution of Informationp. 257
12.3.4 The Impossiblep. 258
12.3.5 Overall Performancep. 258
12.4 Alternativesp. 259
12.4.1 Functional Coherence Analysis of Gene Groupsp. 259
12.4.2 Co-Occurrence Networksp. 260
12.4.3 Superimposition of Experimental Data to the Literature Networkp. 260
12.4.4 Gene Ontologiesp. 261
12.5 Case Studyp. 261
12.6 Lessons Learnedp. 265
12.7 List of Tools and Resourcesp. 266
12.8 Conclusionp. 266
12.9 Mathematical Detailsp. 270
Referencesp. 270
Indexp. 275