The analysis of gene expression data : methods and software

Thedevelopmentoftechnologiesforhigh-throughputmeasurementofgene expression in biological system is providing powerful new tools for inv- tigating the transcriptome on a genomic scale, and across diverse biol- ical systems and experimental designs. This technological transformation is generating an increasing demand for data analysis in biological inv- tigations of gene expression. This book focuses on data analysis of gene expression microarrays. The goal is to provide guidance to practitioners in deciding which statistical approaches and packages may be indicated for their projects, in choosing among the various options provided by those packages, and in correctly interpreting the results. The book is a collection of chapters written by authors of statistical so- ware for microarray data analysis. Each chapter describes the conceptual and methodological underpinning of data analysis tools as well as their software implementation, and will enable readers to both understand and implement an analysis approach. Methods touch on all aspects of statis- cal analysis of microarrays, from annotation and ?ltering to clustering and classi?cation. All software packages described are free to academic users. The materials presented cover a range of software tools designed for varied audiences. Some chapters describe simple menu-driven software in a user-friendly fashion and are designed to be accessible to microarray data analystswithoutformalquantitativetraining.Mostchaptersaredirectedat microarray data analysts with master's-level training in computer science, biostatistics, or bioinformatics. A minority of more advanced chapters are intended for doctoral students and researchers.

Giovanni Parmigiani and Elizabeth S. Garrett and Rafael A. Irizarry and Scott L. ZegerRobert Gentleman and Vincent CareySandrine Dudoit and Jean Yee Hwa YangRafael A. Irizarry and Laurent Gautier and Leslie M. CopeCheng Li and Wing Hung WongJaak Vilo and Misha Kapushesky and Patrick Kemmeren and Ugis Sarkans and Alvis BrazmaJae K. Lee and Michael O'ConnellChristopher M.L.S. Bouton and George Henry and Carlo Colantuoni and Jonathan PevsnerCarlo Colantuoni and George Henry and Christopher M.L.S. Bouton and Scott L. Zeger and Jonathan PevsnerPeter F. Lemkin and Gregory C. Thornwall and Jai EvansMichael A. Newton and Christina KendziorskiJohn D. Storey and Robert TibshiraniYi Lin and Samuel T. Nadler and Hong Lan and Alan D. Attie and Brian S. YandellHao Wu and M. Kathleen Kerr and Xiangqin Cui and Gary A. ChurchillKim-Anh Do and Bradley Broom and Sijin WenElizabeth S. Garrett and Giovanni ParmigianiMichael F. OchsPaola Sebastiani and Marco Ramoni and Isaac S. KohaneAtul J. Butte and Isaac S. Kohane

Preface	p. v
Contributors	p. xvii
Color Insert
1 The Analysis of Gene Expression Data: An Overview of Methods and Software	p. 1
1.1 Measuring Gene Expression Using Microarrays	p. 1
1.1.1 Microarray Technologies	p. 1
1.1.2 Sources of Variation in Gene Expression Measurements Using Microarrays	p. 4
1.1.3 Phases of Microarray Data Analysis	p. 5
1.2 Design of Microarray Experiments	p. 7
1.2.1 Replication and Sample Size Considerations	p. 7
1.2.2 Design of Two-Channel Arrays	p. 9
1.3 Data Storage	p. 9
1.3.1 Databases	p. 9
1.3.2 Standards	p. 10
1.3.3 Statistical Analysis Languages	p. 11
1.4 Preprocessing	p. 12
1.4.1 Image Analysis	p. 12
1.4.2 Visualizations for Quality Control	p. 12
1.4.3 Background Subtraction	p. 13
1.4.4 Probe-level Analysis of Oligonucleotide Arrays	p. 14
1.4.5 Within-Array Normalization of cDNA Arrays	p. 15
1.4.6 Normalization Across Arrays	p. 15
1.5 Screening for Differentially Expressed Genes	p. 16
1.5.1 Estimation or Selection?	p. 16
1.5.2 One Problem or Many?	p. 17
1.5.3 Selection and False Discovery Rates	p. 18
1.5.4 Beyond Two Groups	p. 19
1.6 Challenges of Genome Biometry Analyses	p. 19
1.7 Visualization and Unsupervised Analyses	p. 21
1.7.1 Profile Visualization	p. 21
1.7.2 Why Clustering?	p. 22
1.7.3 Hierarchical Clustering	p. 23
1.7.4 k-Means Clustering and Self-Organizing Maps	p. 25
1.7.5 Model-Based Clustering	p. 26
1.7.6 Principal Components Analysis	p. 26
1.7.7 Multidimensional Scaling	p. 27
1.7.8 Identifying Novel Molecular Subclasses	p. 27
1.7.9 Time Series Analysis	p. 28
1.8 Prediction	p. 29
1.8.1 Prediction Tools	p. 29
1.8.2 Dimension Reduction	p. 30
1.8.3 Evaluation of Classifiers	p. 30
1.8.4 Regression-Based Approaches	p. 31
1.8.5 Classification Trees	p. 31
1.8.6 Probabilistic Model-Based Classification	p. 32
1.8.7 Discriminant Analysis	p. 33
1.8.8 Nearest-Neighbor Classifiers	p. 33
1.8.9 Support Vector Machines	p. 33
1.9 Free and Open-Source Software	p. 33
1.9.1 Whitehead Institute Tools	p. 34
1.9.2 Eisen Lab Tools	p. 34
1.9.3 TIGR Tools	p. 34
1.9.4 GeneX and CyberT	p. 35
1.9.5 Projects and NCBI	p. 35
1.9.6 BRB	p. 35
1.9.7 The OOML library	p. 36
1.9.8 MatArray	p. 36
1.9.9 BASE	p. 36
1.10 Conclusion	p. 36
2 Visualization and Annotation of Genomic Experiments	p. 46
2.1 Introduction	p. 46
2.2 Motivations for Component-Based Software	p. 47
2.3 Formalism	p. 49
2.4 Bioconductor Software for Filtering, Exploring, and Interpreting Microarray Experiments	p. 50
2.4.1 Formal Data Structures and Methods for Multiple Microarrays	p. 50
2.4.2 Tools for Filtering Gene Expression Data: The Closure Concept	p. 54
2.4.3 Expression Density Diagnostics: High-Throughput Exploratory Data Analysis for Microarrays	p. 55
2.4.4 Annotation	p. 57
2.5 Visualization	p. 58
2.5.1 Chromosomes	p. 59
2.6 Applications	p. 64
2.6.1 A Case Study of Gene Filtering	p. 64
2.6.2 Application of Expression Density Diagnostics	p. 67
2.7 Conclusions	p. 70
3 Bioconductor R Packages for Exploratory Analysis and Normalization of cDNA Microarray Data	p. 73
3.1 Introduction	p. 73
3.1.1 Overview of Packages	p. 73
3.1.2 Two-Color cDNA Microarray Experiments	p. 75
3.2 Methods	p. 76
3.2.1 Standards for Microarray Data	p. 76
3.2.2 Object-Oriented Programming: Microarray Classes and Methods	p. 77
3.2.3 Diagnostic Plots	p. 78
3.2.4 Normalization Using Robust Local Regression	p. 79
3.3 Application: Swirl Microarray Experiment	p. 80
3.4 Software	p. 81
3.4.1 Package marrayClasses--Classes and Methods for cDNA Microarray Data	p. 81
3.4.2 Package marrayInput--Data Input for cDNA Microarrays	p. 89
3.4.3 Package marrayPlots--Diagnostic Plots for cDNA Microarray Data	p. 91
3.4.4 Package marrayNorm--Location and Scale Normalization for cDNA Microarray Data	p. 96
3.5 Discussion	p. 99
4 An R Package for Analyses of Affymetrix Oligonucleotide Arrays	p. 102
4.1 Introduction	p. 102
4.2 Methods	p. 103
4.2.1 Notation	p. 103
4.2.2 The CEL/CDF Convention	p. 104
4.2.3 Probe Pair Sets	p. 106
4.2.4 Probe-Level Objects	p. 107
4.2.5 Normalization	p. 108
4.2.6 Exploratory Data Analysis of Probe-Level Data	p. 111
4.3 Application	p. 113
4.3.1 Expression Measures	p. 113
4.4 Software	p. 115
4.4.1 A Case Study	p. 115
4.4.2 Extending the Package	p. 118
4.5 Conclusion	p. 118
5 DNA-Chip Analyzer (dChip)	p. 120
5.1 Introduction	p. 120
5.2 Methods	p. 121
5.2.1 Normalization of Arrays Based on an "Invariant Set"	p. 121
5.2.2 Model-Based Analysis of Oligonucleotide Arrays	p. 122
5.2.3 Confidence Interval for Fold Change	p. 122
5.2.4 Pooling Replicate Arrays Considering Measurement Accuracy	p. 124
5.3 Software and Applications	p. 125
5.3.1 Reading in Array Data Files	p. 125
5.3.2 Viewing an Array Image	p. 127
5.3.3 Normalizing Arrays	p. 129
5.3.4 Viewing PM/MM Data	p. 129
5.3.5 Calculating Model-Based Expression Indexes	p. 131
5.3.6 Filter Genes	p. 132
5.3.7 Hierarchical Clustering	p. 133
5.3.8 Comparing Samples	p. 135
5.3.9 Mapping Genes to Chromosomes	p. 137
5.3.10 Sample Classification by Linear Discriminant Analysis	p. 138
5.4 Discussion	p. 139
6 Expression Profiler	p. 142
6.1 Introduction	p. 142
6.2 EPCLUST	p. 143
6.2.1 EPCLUST: Data Import	p. 143
6.2.2 EPCLUST: Data Filtering	p. 144
6.2.3 EPCLUST: Data Annotation	p. 146
6.2.4 EPCLUST: Data Environment	p. 147
6.2.5 EPCLUST: Data Analysis	p. 148
6.3 URLMAP: Cross-Linking of the Analysis Results Between the Tools and Databases	p. 151
6.4 EP:GO GeneOntology Browser	p. 152
6.5 EP:PPI: Comparison of Protein Pairs and Expression	p. 153
6.6 Pattern Discovery, Pattern Matching, and Visualization Tools	p. 154
6.7 An Example of the Data Analysis and Visualizations Performed by the Tools in Expression Profiler	p. 154
6.8 Integration of Expression Profiler with Public Microarray Databases	p. 159
6.9 Conclusions	p. 160
7 An S-PLUS Library for the Analysis and Visualization of Differential Expression	p. 163
7.1 Introduction	p. 163
7.2 Assessment of Differential Expression	p. 164
7.2.1 Local Pooled Error	p. 165
7.2.2 Tests for Differential Expression	p. 169
7.2.3 Cluster Analysis and Visualization	p. 171
7.3 Analysis of Melanoma Expression	p. 174
7.3.1 Tests for Differential Expression	p. 175
7.3.2 Cluster Analysis and Visualization	p. 178
7.3.3 Annotation	p. 180
7.4 Discussion	p. 181
8 DRAGON and DRAGON View: Methods for the Annotation, Analysis, and Visualization of Large-Scale Gene Expression Data	p. 185
8.1 Introduction	p. 185
8.2 System and Methods	p. 189
8.2.1 Overview of DRAGON	p. 189
8.2.2 DRAGON's Hardware, Software, and Database Architecture	p. 190
8.2.3 Cross-Referencing Information in DRAGON	p. 192
8.2.4 The DRAGON Search and Annotate Tools	p. 193
8.2.5 The DRAGON View Data Visualization Tools	p. 196
8.2.6 DRAGON Gram: A Novel Visualization Tool	p. 198
8.3 Implementation	p. 199
8.4 Discussion and Conclusion	p. 204
9 SNOMAD: Biologist-Friendly Web Tools for the Standardization and NOrmalization of Microarray Data	p. 210
9.1 Introduction	p. 210
9.2 Methods and Application	p. 212
9.2.1 Overview of Experimental and Data Analysis Procedures	p. 212
9.2.2 Background Subtraction	p. 214
9.2.3 Global Mean Normalization	p. 214
9.2.4 Standard Data Transformation and Visualization Methods	p. 215
9.2.5 Local Mean Normalization Across Element Signal Intensity	p. 217
9.2.6 Local Variance Correction Across Element Signal Intensity	p. 219
9.2.7 Local Mean Normalization Across the Microarray Surface	p. 223
9.3 Software	p. 225
9.4 Discussion	p. 226
10 Microarray Analysis Using the MicroArray Explorer	p. 229
10.1 Introduction	p. 229
10.1.1 Need for the Methodology	p. 230
10.1.2 Basic Ideas Behind the Approach	p. 231
10.2 Methods--Statistical and Informatics Basis	p. 232
10.2.1 Analysis Paradigm	p. 235
10.2.2 Particular Analysis Methods	p. 238
10.2.3 Data Conversion	p. 238
10.3 Software	p. 239
10.3.1 System Design--Software Implementation	p. 244
10.3.2 How to Download the Software	p. 247
10.3.3 Strengths and Weaknesses of the Approach	p. 248
10.4 Applications	p. 249
10.5 Discussion	p. 251
11 Parametric Empirical Bayes Methods for Microarrays	p. 254
11.1 Introduction	p. 254
11.2 EB Methods	p. 256
11.2.1 Canonical EB Example	p. 256
11.2.2 General Model Structure: Two Conditions	p. 256
11.2.3 Multiple Conditions	p. 258
11.2.4 The Gamma-Gamma and Lognormal-Normal Models	p. 259
11.2.5 Model Fitting	p. 260
11.3 Software	p. 261
11.4 Application	p. 263
11.5 Discussion	p. 269
12 SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays	p. 272
12.1 Introduction	p. 272
12.2 Methods and Applications	p. 273
12.2.1 Multiple Hypothesis Testing	p. 273
12.2.2 An Application	p. 275
12.2.3 Forming the Test Statistics	p. 276
12.2.4 Calculating the Null Distribution	p. 277
12.2.5 The SAM Thresholding Procedure	p. 278
12.2.6 Estimating False-Discovery Rates	p. 280
12.3 Software	p. 283
12.3.1 Obtaining the Software	p. 283
12.3.2 Data Formats	p. 283
12.3.3 Response Format	p. 284
12.3.4 Example Input Data File for an Unpaired Problem	p. 285
12.3.5 Block Permutations	p. 285
12.3.6 Normalization of Experiments	p. 285
12.3.7 Handling Missing Data	p. 287
12.3.8 Running SAM	p. 287
12.3.9 Format of the Significant Gene List	p. 288
12.4 Discussion	p. 289
13 Adaptive Gene Picking with Microarray Data: Detecting Important Low Abundance Signals	p. 291
13.1 Introduction	p. 291
13.2 Methods	p. 292
13.2.1 Background Subtraction	p. 292
13.2.2 Transformation to Approximate Normality	p. 293
13.2.3 Differential Expression Across Conditions	p. 295
13.2.4 Robust Center and Spread	p. 297
13.2.5 Formal Evaluation of Significant Differential Expression	p. 299
13.2.6 Simulation Studies	p. 301
13.2.7 Comparison of Methods with E. coli Data	p. 304
13.3 Software	p. 304
13.4 Application	p. 306
13.4.1 Diabetes and Obesity Studies	p. 306
13.4.2 Software Example	p. 308
14 MAANOVA: A Software Package for the Analysis of Spotted cDNA Microarray Experiments	p. 313
14.1 Introduction	p. 313
14.2 Methods	p. 314
14.2.1 Data Acquisition	p. 315
14.2.2 ANOVA Models for Microarray Data	p. 315
14.2.3 Experimental Design for Microarrays	p. 317
14.2.4 Data Transformations	p. 321
14.2.5 Algorithms for Computing ANOVA Estimates	p. 322
14.2.6 Statistical Inference	p. 323
14.2.7 Cluster Analysis	p. 327
14.3 Software	p. 328
14.3.1 Availability	p. 328
14.3.2 Functionality	p. 329
14.4 Data Analysis with MAANOVA	p. 334
14.5 Discussion	p. 339
15 GeneClust	p. 342
15.1 Introduction	p. 342
15.2 Methods	p. 343
15.2.1 Algorithm	p. 343
15.2.2 Choice of Cluster Size via the Gap Statistic	p. 344
15.2.3 Supervised Gene Shaving for Class Discrimination	p. 346
15.3 Software	p. 347
15.3.1 The GeneShaving Package	p. 347
15.3.2 GeneClust: A Faster Implementation of Gene Shaving	p. 352
15.4 Applications	p. 354
15.4.1 The Alon Colon Dataset	p. 354
15.4.2 The NCI60 Dataset	p. 356
15.5 Discussion	p. 358
16 POE: Statistical Methods for Qualitative Analysis of Gene Expression	p. 362
16.1 Introduction	p. 362
16.2 Methodology	p. 364
16.2.1 Mixture Model for Gene Expression	p. 364
16.2.2 Useful Representations of the Results	p. 366
16.2.3 Bayesian Hierarchical Model Formulation	p. 367
16.2.4 Restrictions to Remove Ambiguity in the Case of Only Two Components	p. 368
16.2.5 Mining for Subsets of Genes	p. 368
16.2.6 Creating Molecular Profiles	p. 370
16.3 R Software Extension: POE	p. 371
16.3.1 An Example of Using POE on Simulated Data	p. 371
16.3.2 Estimating Posterior Probability of Expression Using poe.fit	p. 372
16.3.3 Visualization Tools	p. 374
16.3.4 Gene-Mining Functions	p. 377
16.3.5 Molecular Profiling Tool	p. 379
16.4 Results of POE Applied to Lung Cancer Data	p. 381
16.5 Discussion and Future Work	p. 384
17 Bayesian Decomposition	p. 388
17.1 Introduction	p. 388
17.1.1 Role of Signaling and Metabolic Pathways	p. 388
17.1.2 Gene Expression Microarrays	p. 389
17.2 Methods	p. 390
17.2.1 Matrix Decomposition	p. 390
17.2.2 Markov Chain Monte Carlo	p. 391
17.2.3 Bayesian Framework	p. 392
17.2.4 The Prior Distribution	p. 393
17.2.5 Summary Statistics	p. 395
17.3 Software	p. 396
17.3.1 Implementation	p. 396
17.3.2 Files and Installation	p. 396
17.3.3 Issues in the Application of Bayesian Decomposition	p. 397
17.4 Application of Bayesian Decomposition to Yeast Cell Cycle Data	p. 398
17.4.1 Preparation of the Data	p. 398
17.4.2 Running the Program	p. 399
17.4.3 Visualizing the Output	p. 400
17.4.4 Interpretation	p. 402
17.4.5 Advantages of Bayesian Decomposition	p. 403
17.5 Discussion	p. 403
18 Bayesian Clustering of Gene Expression Dynamics	p. 409
18.1 Introduction	p. 409
18.2 Methods	p. 411
18.2.1 Modeling Time	p. 412
18.2.2 Probabilistic Scoring Metric	p. 413
18.2.3 Heuristic Search	p. 415
18.2.4 Statistical Diagnostics	p. 416
18.3 Software	p. 417
18.3.1 Screen 0: Welcome Screen	p. 417
18.3.2 Screen 1: Getting Started	p. 418
18.3.3 Screen 2: Analysis	p. 418
18.3.4 Screen 3: Cluster Model	p. 419
18.3.5 Screen 4: Pack and Go!	p. 419
18.4 Application	p. 420
18.4.1 Analysis	p. 420
18.4.2 Statistical Diagnostics	p. 421
18.4.3 Understanding the Model	p. 421
18.5 Conclusions	p. 424
19 Relevance Networks: A First Step Toward Finding Genetic Regulatory Networks Within Microarray Data	p. 428
19.1 Introduction	p. 428
19.1.1 Advantages of Relevance Networks	p. 429
19.2 Methodology	p. 431
19.2.1 Formal Definition of Relevance Networks	p. 431
19.2.2 Finding Regulatory Networks in Phenotypic Data	p. 432
19.2.3 Using Entropy and Mutual Information to Evaluate Gene-Gene Associations	p. 434
19.3 Applications	p. 437
19.3.1 Finding Pharmacogenomic Regulatory Networks	p. 437
19.3.2 Setting the Threshold	p. 439
19.4 Software	p. 440
Index	p. 447

Available:*

On Order

Summary

Summary

Table of Contents