Title:
Fundamentals of data mining in genomics and proteomics
Publication Information:
New York, NY : Springer-Verlag, 2007
Physical Description:
xxii, 281 p. : ill., digital ; 25 cm.
ISBN:
9780387475080
9780387475097
General Note:
Also available online version
Added Corporate Author:
Electronic Access:
Full TextAvailable:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010138257 | QH452.7 F86 2007 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
This book presents state-of-the-art analytical methods from statistics and data mining for the analysis of high-throughput data from genomics and proteomics. It adopts an approach focusing on concepts and applications and presents key analytical techniques for the analysis of genomics and proteomics data by detailing their underlying principles, merits and limitations.
Table of Contents
1 Introduction to Genomic and Proteomic Data Analysis | p. 1 |
1.1 Introduction | p. 1 |
1.2 A Short Overview of Wet Lab Techniques | p. 3 |
1.2.1 Transcriptomics Techniques in a Nutshell | p. 3 |
1.2.2 Proteomics Techniques in a Nutshell | p. 5 |
1.3 A Few Words on Terminology | p. 6 |
1.4 Study Design | p. 7 |
1.5 Data Mining | p. 8 |
1.5.1 Mapping Scientific Questions to Analytical Tasks | p. 9 |
1.5.2 Visual Inspection | p. 11 |
1.5.3 Data Pre-Processing | p. 13 |
1.5.3.1 Handling of Missing Values | p. 13 |
1.5.3.2 Data Transformations | p. 14 |
1.5.4 The Problem of Dimensionality | p. 15 |
1.5.4.1 Mapping to Lower Dimensions | p. 15 |
1.5.4.2 Feature Selection and Significance Analysis | p. 16 |
1.5.4.3 Test Statistics for Discriminatory Features | p. 17 |
1.5.4.4 Multiple Hypotheses Testing | p. 19 |
1.5.4.5 Random Permutation Tests | p. 21 |
1.5.5 Predictive Model Construction | p. 22 |
1.5.5.1 Basic Measures of Performance | p. 24 |
1.5.5.2 Training, Validating, and Testing | p. 25 |
1.5.5.3 Data Resampling Strategies | p. 27 |
1.5.6 Statistical Significance Tests for Comparing Models | p. 29 |
1.6 Result Post-Processing | p. 31 |
1.6.1 Statistical Validation | p. 31 |
1.6.2 Epistemological Validation | p. 32 |
1.6.3 Biological Validation | p. 32 |
1.7 Conclusions | p. 32 |
References | p. 33 |
2 Design Principles for Microarray Investigations | p. 39 |
2.1 Introduction | p. 39 |
2.2 The "Pre-Planning" Stage | p. 39 |
2.2.1 Goal 1: Unsupervised Learning | p. 40 |
2.2.2 Goal 2: Supervised Learning | p. 41 |
2.2.3 Goal 3: Class Comparison | p. 41 |
2.3 Statistical Design Principles, Applied to Microarrays | p. 42 |
2.3.1 Replication | p. 42 |
2.3.2 Blocking | p. 43 |
2.3.3 Randomization | p. 46 |
2.4 Case Study | p. 47 |
2.5 Conclusions | p. 47 |
References | p. 48 |
3 Pre-Processing DNA Microarray Data | p. 51 |
3.1 Introduction | p. 51 |
3.1.1 Affymetrix GeneChips | p. 53 |
3.1.2 Two-Color Microarrays | p. 55 |
3.2 Basic Concepts | p. 55 |
3.2.1 Pre-Processing Affymetrix GeneChip Data | p. 56 |
3.2.2 Pre-Processing Two-Color Microarray Data | p. 59 |
3.3 Advantages and Disadvantages | p. 62 |
3.3.1 Affymetrix GeneChip Data | p. 62 |
3.3.1.1 Advantages | p. 62 |
3.3.1.2 Disadvantages | p. 62 |
3.3.2 Two-Color Microarrays | p. 62 |
3.3.2.1 Advantages | p. 62 |
3.3.2.2 Disadvantages | p. 63 |
3.4 Caveats and Pitfalls | p. 63 |
3.5 Alternatives | p. 63 |
3.5.1 Affymetrix GeneChip Data | p. 63 |
3.5.2 Two-Color Microarrays | p. 64 |
3.6 Case Study | p. 64 |
3.6.1 Pre-Processing an Affymetrix GeneChip Data Set | p. 64 |
3.6.2 Pre-Processing a Two-Channel Microarray Data Set | p. 69 |
3.7 Lessons Learned | p. 73 |
3.8 List of Tools and Resources | p. 74 |
3.9 Conclusions | p. 74 |
3.10 Mathematical Details | p. 74 |
3.10.1 RMA Background Correction Equation | p. 74 |
3.10.2 Quantile Normalization | p. 75 |
3.10.3 RMA Model | p. 75 |
3.10.4 Quality Assessment Statistics | p. 75 |
3.10.5 Computation of M and A Values for Two-Channel Microarray Data | p. 76 |
3.10.6 Print-Tip Loess Normalization | p. 76 |
References | p. 76 |
4 Pre-Processing Mass Spectrometry Data | p. 79 |
4.1 Introduction | p. 79 |
4.2 Basic Concepts | p. 82 |
4.3 Advantages and Disadvantages | p. 83 |
4.4 Caveats and Pitfalls | p. 87 |
4.5 Alternatives | p. 89 |
4.6 Case Study: Experimental and Simulated Data Sets for Comparing Pre-Processing Methods | p. 92 |
4.7 Lessons Learned | p. 98 |
4.8 List of Tools and Resources | p. 98 |
4.9 Conclusions | p. 99 |
References | p. 99 |
5 Visualization in Genomics and Proteomics | p. 103 |
5.1 Introduction | p. 103 |
5.2 Basic Concepts | p. 105 |
5.2.1 Metric Scaling | p. 107 |
5.2.2 Nonmetric Scaling | p. 109 |
5.3 Advantages and Disadvantages | p. 109 |
5.4 Caveats and Pitfalls | p. 110 |
5.5 Alternatives | p. 112 |
5.6 Case Study: MDS on Mass Spectrometry Data | p. 113 |
5.7 Lessons Learned | p. 118 |
5.8 List of Tools and Resources | p. 119 |
5.9 Conclusions | p. 120 |
References | p. 121 |
6 Clustering - Class Discovery in the Post-Genomic Era | p. 123 |
6.1 Introduction | p. 123 |
6.2 Basic Concepts | p. 126 |
6.2.1 Distance Metrics | p. 126 |
6.2.2 Clustering Methods | p. 127 |
6.2.2.1 Aggregative Hierarchical Clustering | p. 128 |
6.2.2.2 k-Means | p. 129 |
6.2.2.3 Self-Organizing Maps | p. 130 |
6.2.2.4 Self-Organizing Tree Algorithm | p. 130 |
6.2.2.5 Model-Based Clustering | p. 130 |
6.2.3 Biclustering | p. 131 |
6.2.4 Validation Methods | p. 131 |
6.2.5 Functional Annotation | p. 132 |
6.3 Advantages and Disadvantages | p. 132 |
6.4 Caveats and Pitfalls | p. 134 |
6.4.1 On Distances | p. 135 |
6.4.2 On Clustering Methods | p. 135 |
6.5 Alternatives | p. 136 |
6.6 Case Study | p. 137 |
6.7 Lessons Learned | p. 139 |
6.8 List of Tools and Resources | p. 140 |
6.8.1 General Resources | p. 140 |
6.8.1.1 Multiple Purpose Tools (Including Clustering) | p. 140 |
6.8.2 Clustering Tools | p. 141 |
6.8.3 Biclustering Tools | p. 141 |
6.8.4 Time Series | p. 141 |
6.8.5 Public-Domain Statistical Packages and Other Tools | p. 141 |
6.8.6 Functional Analysis Tools | p. 142 |
6.9 Conclusions | p. 142 |
References | p. 143 |
7 Feature Selection and Dimensionality Reduction in Genomics and Proteomics | p. 149 |
7.1 Introduction | p. 149 |
7.2 Basic Concepts | p. 151 |
7.2.1 Filter Methods | p. 151 |
7.2.1.1 Criteria Based on Hypothesis Testing | p. 151 |
7.2.1.2 Permutation Tests | p. 152 |
7.2.1.3 Choosing Features Based on the Score | p. 153 |
7.2.1.4 Feature Set Selection and Controlling False Positives | p. 153 |
7.2.1.5 Correlation Filtering | p. 154 |
7.2.2 Wrapper Methods | p. 155 |
7.2.3 Embedded Methods | p. 155 |
7.2.3.1 Regularization/Shrinkage Methods | p. 155 |
7.2.3.2 Support Vector Machines | p. 156 |
7.2.4 Feature Construction | p. 156 |
7.2.4.1 Clustering | p. 156 |
7.2.4.2 Clustering Algorithms | p. 158 |
7.2.4.3 Probabilistic (Soft) Clustering | p. 158 |
7.2.4.4 Clustering Features | p. 158 |
7.2.4.5 Principal Component Analysis | p. 159 |
7.2.4.6 Discriminative Projections | p. 159 |
7.3 Advantages and Disadvantages | p. 160 |
7.4 Case Study: Pancreatic Cancer | p. 161 |
7.4.1 Data and Pre-Processing | p. 161 |
7.4.2 Filter Methods | p. 162 |
7.4.2.1 Basic Filter Methods | p. 162 |
7.4.2.2 Controlling False Positive Selections | p. 162 |
7.4.2.3 Correlation Filters | p. 164 |
7.4.3 Wrapper Methods | p. 165 |
7.4.4 Embedded Methods | p. 166 |
7.4.5 Feature Construction Methods | p. 167 |
7.4.6 Summary of Analysis Results and Recommendations | p. 168 |
7.5 Conclusions | p. 169 |
7.6 Mathematical Details | p. 169 |
References | p. 170 |
8 Resampling Strategies for Model Assessment and Selection | p. 173 |
8.1 Introduction | p. 173 |
8.2 Basic Concepts | p. 174 |
8.2.1 Resubstitution Estimate of Prediction Error | p. 174 |
8.2.2 Split-Sample Estimate of Prediction Error | p. 175 |
8.3 Resampling Methods | p. 176 |
8.3.1 Leave-One-Out Cross-Validation | p. 177 |
8.3.2 k-fold Cross-Validation | p. 178 |
8.3.3 Monte Carlo Cross-Validation | p. 178 |
8.3.4 Bootstrap Resampling | p. 179 |
8.3.4.1 The .632 Bootstrap | p. 179 |
8.3.4.2 The .632+ Bootstrap | p. 180 |
8.4 Resampling for Model Selection and Optimizing Tuning Parameters | p. 181 |
8.4.1 Estimating Statistical Significance of Classification Error Rates | p. 183 |
8.4.2 Comparison to Classifiers Based on Standard Prognostic Variables | p. 183 |
8.5 Comparison of Resampling Strategies | p. 184 |
8.6 Tools and Resources | p. 184 |
8.7 Conclusions | p. 185 |
References | p. 186 |
9 Classification of Genomic and Proteomic Data Using Support Vector Machines | p. 187 |
9.1 Introduction | p. 187 |
9.2 Basic Concepts | p. 187 |
9.2.1 Support Vector Machines | p. 188 |
9.2.2 Feature Selection | p. 190 |
9.2.3 Evaluating Predictive Performance | p. 191 |
9.3 Advantages and Disadvantages | p. 192 |
9.3.1 Advantages | p. 192 |
9.3.2 Disadvantages | p. 192 |
9.4 Caveats and Pitfalls | p. 192 |
9.5 Alternatives | p. 193 |
9.6 Case Study: Classification of Mass Spectral Serum Profiles Using Support Vector Machines | p. 193 |
9.6.1 Data Set | p. 193 |
9.6.2 Analysis Strategies | p. 194 |
9.6.2.1 Strategy A: SVM without Feature Selection | p. 196 |
9.6.2.2 Strategy B: SVM with Feature Selection | p. 196 |
9.6.2.3 Strategy C: SVM Optimized Using Test Samples Performance | p. 196 |
9.6.2.4 Strategy D: SVM with Feature Selection Using Test Samples | p. 196 |
9.6.3 Results | p. 196 |
9.7 Lessons Learned | p. 197 |
9.8 List of Tools and Resources | p. 197 |
9.9 Conclusions | p. 198 |
9.10 Mathematical Details | p. 198 |
References | p. 200 |
10 Networks in Cell Biology | p. 203 |
10.1 Introduction | p. 203 |
10.1.1 Protein Networks | p. 204 |
10.1.2 Metabolic Networks | p. 205 |
10.1.3 Transcriptional Regulation Maps | p. 205 |
10.1.4 Signal Transduction Pathways | p. 206 |
10.2 Basic Concepts | p. 206 |
10.2.1 Graph Definition | p. 206 |
10.2.2 Node Attributes | p. 207 |
10.2.3 Graph Attributes | p. 208 |
10.3 Caveats and Pitfalls | p. 212 |
10.4 Case Study: Topological Analysis of the Human Transcription Factor Interaction Network | p. 213 |
10.5 Lessons Learned | p. 218 |
10.6 List of Tools and Resources | p. 219 |
10.7 Conclusions | p. 220 |
10.8 Mathematical Details | p. 220 |
References | p. 221 |
11 Identifying Important Explanatory Variables for Time-Varying Outcomes | p. 227 |
11.1 Introduction | p. 227 |
11.2 Basic Concepts | p. 229 |
11.3 Advantages and Disadvantages | p. 233 |
11.3.1 Advantages | p. 233 |
11.3.2 Disadvantages | p. 234 |
11.4 Caveats and Pitfalls | p. 235 |
11.5 Alternatives | p. 237 |
11.6 Case Study: HIV Drug Resistance Mutations | p. 239 |
11.7 Lessons Learned | p. 245 |
11.8 List of Tools and Resources | p. 246 |
11.9 Conclusions | p. 247 |
References | p. 248 |
12 Text Mining in Genomics and Proteomics | p. 251 |
12.1 Introduction | p. 251 |
12.1.1 Text Mining | p. 251 |
12.1.2 Interactive Literature Exploration | p. 253 |
12.2 Basic Concepts | p. 253 |
12.2.1 Information Retrieval | p. 253 |
12.2.2 Entity Recognition | p. 254 |
12.2.3 Information Extraction | p. 254 |
12.2.4 Biomedical Text Resources | p. 255 |
12.2.5 Assessment and Comparison of Text Mining Methods | p. 256 |
12.3 Caveats and Pitfalls | p. 256 |
12.3.1 Entity Recognition | p. 256 |
12.3.2 Full Text | p. 257 |
12.3.3 Distribution of Information | p. 257 |
12.3.4 The Impossible | p. 258 |
12.3.5 Overall Performance | p. 258 |
12.4 Alternatives | p. 259 |
12.4.1 Functional Coherence Analysis of Gene Groups | p. 259 |
12.4.2 Co-Occurrence Networks | p. 260 |
12.4.3 Superimposition of Experimental Data to the Literature Network | p. 260 |
12.4.4 Gene Ontologies | p. 261 |
12.5 Case Study | p. 261 |
12.6 Lessons Learned | p. 265 |
12.7 List of Tools and Resources | p. 266 |
12.8 Conclusion | p. 266 |
12.9 Mathematical Details | p. 270 |
References | p. 270 |
Index | p. 275 |