Bioinformatics : the machine learning approach

An unprecedented wealth of data is being generated by genome sequencing projects and other experimental efforts to determine the structure and function of biological molecules. The demands and opportunities for interpreting these data are expanding rapidly. Bioinformatics is the development and application of computer methods for management, analysis, interpretation, and prediction, as well as for the design of experiments. Machine learning approaches (e.g., neural networks, hidden Markov models, and belief networks) are ideally suited for areas where there is a lot of data but little theory, which is the situation in molecular biology. The goal in machine learning is to extract useful information from a body of data by building good probabilistic models-and to automate the process as much as possible.

In this book Pierre Baldi and S ren Brunak present the key machine learning approaches and apply them to the computational problems encountered in the analysis of biological data. The book is aimed both at biologists and biochemists who need to understand new data-driven algorithms and at those with a primary background in physics, mathematics, statistics, or computer science who need to know more about applications in molecular biology.

This new second edition contains expanded coverage of probabilistic graphical models and of the applications of neural networks, as well as a new chapter on microarrays and gene expression. The entire text has been extensively revised.

Author Notes

Pierre Baldi is Professor of Information and Computer Science and of Biological Chemistry (College of Medicine) and Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine.

S ren Brunak is Professor and Director of the Center for Biological Sequence Analysis at the Technical University of Denmark.

Series Foreword

Preface

1 Introduction

1.1 Biological Data in Digital Symbol Sequences

1.2 Genomes--Diversity, Size, and Structure

1.3 Proteins and Proteomes

1.4 On the Information Content of Biological Sequences

1.5 Prediction of Molecular Function and Structure

2 Machine Learning Foundations: The Probabilistic Framework

2.1 Introduction: Bayesian Modeling

2.2 The Cox-Jaynes Axioms

2.3 Bayesian Inference and Induction

2.4 Model Structures: Graphical Models and Other Tricks

2.5 Summary

3 Probabilistic Modeling and Inference: Examples

3.1 The Simplest Sequence Models

3.2 Statistical Mechanics

4 Machine Learning Algorithms

4.1 Introduction

4.2 Dynamic Programming

4.3 Gradient Descent

4.4 EM/GEM Algorithms

4.5 Markov Chain Monte Carlo Methods

4.6 Simulated Annealing

4.7 Evolutionary and Genetic Algorithms

4.8 Learning Algorithms: Miscellaneous Aspects

5 Neural Networks: The Theory

5.1 Introduction

5.2 Universal Approximation Properties

5.3 Priors and Likelihoods

5.4 Learning Algorithms: Backpropagation

6 Neural Networks: Applications

6.1 Sequence Encoding and Output Interpretation

6.2 Prediction of Protein Secondary Structure

6.3 Prediction of Signal Peptides and Their Cleavage Sites

6.4 Applications for DNA and RNA Nucleotide Sequences

7 Hidden Markov Models: The Theory

7.1 Introduction

7.2 Prior Information and Initialization

7.3 Likelihood and Basic Algorithms

7.4 Learning Algorithms

7.5 Applications of HMMs: General Aspects

8 Hidden Markov Models: Applications

8.1 Protein Applications

8.2 DNA and RNA Applications

8.3 Conclusion: Advantages and Limitations of HMMs

9 Hybrid Systems: Hidden Markov Models and Neural Networks

9.1 Introduction to Hybrid Models

9.2 The Single-Model Case

9.3 The Multiple-Model Case

9.4 Simulation Results

9.5 Summary

10 Probabilistic Models of Evolution: Phylogenetic Trees

10.1 Introduction to Probabilistic Models of Evolution

10.2 Substitution Probabilities and Evolutionary Rates

10.3 Rates of Evolution

10.4 Data Likelihood

10.5 Optimal Trees and Learning

10.6 Parsimony

10.7 Extensions

11 Stochastic Grammars and Linguistics

11.1 Introduction to Formal Grammars

11.2 Formal Grammars and the Chomsky Hierarchy

11.3 Applications of Grammars to Biological Sequences

11.4 Prior Information and Initialization

11.5 Likelihood

11.6 Learning Algorithms

11.7 Applications of SCFGs

11.8 Experiments

11.9 Future Directions

12 Internet Resources and Public Databases

12.1 A Rapidly Changing Set of Resources

12.2 Databases over Databases and Tools

12.3 Databases over Databases

12.4 Databases

12.5 Sequence Similarity Searches

12.6 Alignment

12.7 Selected Prediction Servers

12.8 Molecular Biology Software Links

12.9 Ph.D. Courses over the Internet

12.10 HMM/NN Simulator

A Statistics

A.1 Decision Theory and Loss Functions

A.2 Quadratic Loss Functions

A.3 The Bias/Variance Trade-off

A.4 Combining Estimators

A.5 Error Bars

A.6 Sufficient Statistics

A.7 Exponential Family

A.8 Gaussian Process Models

A.9 Variational Methods

B Information Theory, Entropy, and Relative Entropy

B.1 Entropy

B.2 Relative Entropy

B.3 Mutual Information

B.4 Jensen's Inequality

B.5 Maximum Entropy

B.6 Minimum Relative Entropy

C Probabilistic Graphical Models

C.1 Notation and Preliminaries

C.2 The Undirected Case: Markov Random Fields

C.3 The Directed Case: Bayesian Networks

D HMM Technicalities, Scaling, Periodic Architectures, State Functions, and Dirichlet Mixtures

D.1 Scaling

D.2 Periodic Architectures

D.3 State Functions: Bendability

D.4 Dirichlet Mixtures

E List of Main Symbols and Abbreviations

References

Index

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents