Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010161065 | P98.3 A26 2008 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
The rapid advancement in the theoretical understanding of statistical and machine learning methods for semisupervised learning has made it difficult for nonspecialists to keep up to date in the field. Providing a broad, accessible treatment of the theory as well as linguistic applications, Semisupervised Learning for Computational Linguistics offers self-contained coverage of semisupervised methods that includes background material on supervised and unsupervised learning.
The book presents a brief history of semisupervised learning and its place in the spectrum of learning methods before moving on to discuss well-known natural language processing methods, such as self-training and co-training. It then centers on machine learning techniques, including the boundary-oriented methods of perceptrons, boosting, support vector machines (SVMs), and the null-category noise model. In addition, the book covers clustering, the expectation-maximization (EM) algorithm, related generative methods, and agreement methods. It concludes with the graph-based method of label propagation as well as a detailed discussion of spectral methods.
Taking an intuitive approach to the material, this lucid book facilitates the application of semisupervised learning methods to natural language processing and provides the framework and motivation for a more systematic study of machine learning.
Author Notes
Abney, Steven
Table of Contents
1 Introduction | p. 1 |
1.1 A brief history | p. 1 |
1.1.1 Probabilistic methods in computational linguistics | p. 1 |
1.1.2 Supervised and unsupervised training | p. 2 |
1.1.3 Semisupervised learning | p. 3 |
1.2 Semisupervised learning | p. 4 |
1.2.1 Major varieties of learning problem | p. 4 |
1.2.2 Motivation | p. 6 |
1.2.3 Evaluation | p. 7 |
1.2.4 Active learning | p. 8 |
1.3 Organization and assumptions | p. 8 |
1.3.1 Leading ideas | p. 8 |
1.3.2 Mathematical background | p. 10 |
1.3.3 Notation | p. 11 |
2 Self-training and Co-training | p. 13 |
2.1 Classification | p. 13 |
2.1.1 The standard setting | p. 13 |
2.1.2 Features and rules | p. 14 |
2.1.3 Decision lists | p. 16 |
2.2 Self-training | p. 18 |
2.2.1 The algorithm | p. 19 |
2.2.2 Parameters and variants | p. 20 |
2.2.3 Evaluation | p. 23 |
2.2.4 Symmetry of features and instances | p. 25 |
2.2.5 Related algorithms | p. 27 |
2.3 Co-Training | p. 28 |
3 Applications of Self-Training and Co-Training | p. 31 |
3.1 Part-of-speech tagging | p. 31 |
3.2 Information extraction | p. 33 |
3.3 Parsing | p. 35 |
3.4 Word senses | p. 36 |
3.4.1 WordNet | p. 36 |
3.4.2 Word-sense disambiguation | p. 38 |
3.4.3 Taxonomic inference | p. 40 |
4 Classification | p. 43 |
4.1 Two simple classifiers | p. 43 |
4.1.1 Naive Bayes | p. 43 |
4.1.2 k-nearest-neighbor classifier | p. 45 |
4.2 Abstract setting | p. 48 |
4.2.1 Function approximation | p. 48 |
4.2.2 Defining succes | p. 50 |
4.2.3 Fit and simplicity | p. 52 |
4.3 Evaluating detectors and classifiers that abstain | p. 53 |
4.3.1 Confidence-rated classifiers | p. 53 |
4.3.2 Measures for detection | p. 54 |
4.3.3 Idealized performance curves | p. 57 |
4.3.4 The multiclass case | p. 59 |
4.4 Binary classifiers and ECOC | p. 62 |
5 Mathematics for Boundary-Oriented Methods | p. 67 |
5.1 Linear separators | p. 67 |
5.1.1 Representing a hyperplane | p. 67 |
5.1.2 Eliminating the threshold | p. 69 |
5.1.3 The point-normal form | p. 70 |
5.1.4 Naive Bayes decision boundary | p. 72 |
5.2 The gradient | p. 74 |
5.2.1 Graphs and domains | p. 74 |
5.2.2 Convexity | p. 76 |
5.2.3 Differentiation of vector and matrix expressions | p. 79 |
5.2.4 An example: linear regression | p. 81 |
5.3 Constrained optimization | p. 83 |
5.3.1 Optimization | p. 83 |
5.3.2 Equality constraints | p. 84 |
5.3.3 Inequality constraints | p. 87 |
5.3.4 The Wolfe dual | p. 91 |
6 Boundary-Oriented Methods | p. 95 |
6.1 The perceptron | p. 97 |
6.1.1 The algorithm | p. 97 |
6.1.2 An example | p. 99 |
6.1.3 Convergence | p. 100 |
6.1.4 The perceptron algorithm as gradient descent | p. 101 |
6.2 Game self-teaching | p. 103 |
6.3 Boosting | p. 105 |
6.3.1 Abstention | p. 110 |
6.3.2 Semisupervised boosting | p. 111 |
6.3.3 Co-boosting | p. 113 |
6.4 Support Vector Machines (SVMs) | p. 114 |
6.4.1 The margin | p. 114 |
6.4.2 Maximizing the margin | p. 116 |
6.4.3 The nonseparable case | p. 119 |
6.4.4 Slack in the separable case | p. 121 |
6.4.5 Multiple slack points | p. 123 |
6.4.6 Transductive SVMs | p. 125 |
6.4.7 Training a transductive SVM | p. 127 |
6.5 Null-category noise model | p. 129 |
7 Clustering | p. 131 |
7.1 Cluster and label | p. 131 |
7.2 Clustering concepts | p. 132 |
7.2.1 Objective | p. 132 |
7.2.2 Distance and similarity | p. 133 |
7.2.3 Graphs | p. 136 |
7.3 Hierarchical clustering | p. 137 |
7.4 Self-training revisited | p. 139 |
7.4.1 k-means clustering | p. 139 |
7.4.2 Pseudo relevance feedback | p. 140 |
7.5 Graph mincut | p. 143 |
7.6 Label propagation | p. 146 |
7.6.1 Clustering by propagation | p. 146 |
7.6.2 Self-training as propagation | p. 147 |
7.6.3 Co-training as propagation | p. 150 |
7.7 Bibliographic notes | p. 152 |
8 Generative Models | p. 153 |
8.1 Gaussian mixtures | p. 153 |
8.1.1 Definition and geometric interpretation | p. 153 |
8.1.2 The linear discriminant decision boundary | p. 156 |
8.1.3 Decision-directed approximation | p. 159 |
8.1.4 McLachlan's algorithm | p. 162 |
8.2 The EM algorithm | p. 163 |
8.2.1 Maximizing likelihood | p. 163 |
8.2.2 Relative frequency estimation | p. 164 |
8.2.3 Divergence | p. 166 |
8.2.4 The EM algorithm | p. 169 |
9 Agreement Constraints | p. 175 |
9.1 Co-training | p. 175 |
9.1.1 The conditional independence assumption | p. 176 |
9.1.2 The power of conditional independence | p. 178 |
9.2 Agreement-based self-teaching | p. 182 |
9.3 Random fields | p. 184 |
9.3.1 Applied to self-training and co-training | p. 184 |
9.3.2 Gibbs sampling | p. 186 |
9.3.3 Markov chains and random walks | p. 187 |
9.4 Bibliographic notes | p. 192 |
10 Propagation Methods | p. 193 |
10.1 Label propagation | p. 194 |
10.2 Random walks | p. 196 |
10.3 Harmonic functions | p. 198 |
10.4 Fluids | p. 203 |
10.4.1 Flow | p. 203 |
10.4.2 Pressure | p. 205 |
10.4.3 Conservation of energy | p. 209 |
10.4.4 Thomson's principle | p. 210 |
10.5 Computing the solution | p. 213 |
10.6 Graph mincuts revisited | p. 215 |
10.7 Bibliographic notes | p. 220 |
11 Mathematics for Spectral Methods | p. 221 |
11.1 Some basic concepts | p. 221 |
11.1.1 The norm of a vector | p. 221 |
11.1.2 Matrices as linear operators | p. 222 |
11.1.3 The column space | p. 222 |
11.2 Eigenvalues and eigenvectors | p. 224 |
11.2.1 Definition of eigenvalues and eigenvectors | p. 224 |
11.2.2 Diagonalization | p. 225 |
11.2.3 Orthogonal diagonalization | p. 226 |
11.3 Eigenvalues and the scaling effects of a matrix | p. 227 |
11.3.1 Matrix norms | p. 227 |
11.3.2 The Rayleigh quotient | p. 228 |
11.3.3 The 2 x 2 case | p. 230 |
11.3.4 The general case | p. 232 |
11.3.5 The Courant-Fischer minimax theorem | p. 234 |
11.4 Bibliographic notes | p. 236 |
12 Spectral Methods | p. 237 |
12.1 Simple harmonic motion | p. 237 |
12.1.1 Harmonics | p. 237 |
12.1.2 Mixtures of harmonics | p. 239 |
12.1.3 An oscillating particle | p. 241 |
12.1.4 A vibrating string | p. 243 |
12.2 Spectra of matrices and graphs | p. 251 |
12.2.1 The spectrum of a matrix | p. 252 |
12.2.2 Relating matrices and graphs | p. 253 |
12.2.3 The Laplacian matrix and graph spectrum | p. 256 |
12.3 Spectral clustering | p. 257 |
12.3.1 The second smallest eigenvector of the Laplacian | p. 257 |
12.3.2 The cut size and the Laplacian | p. 259 |
12.3.3 Approximating cut size | p. 260 |
12.3.4 Minimizing cut size | p. 262 |
12.3.5 Ratiocut | p. 263 |
12.4 Spectral methods for semisupervised learning | p. 265 |
12.4.1 Harmonics and harmonic functions | p. 265 |
12.4.2 Eigenvalues and energy | p. 267 |
12.4.3 The Laplacian and random fields | p. 268 |
12.4.4 Harmonic functions and the Laplacian | p. 270 |
12.4.5 Using the Laplacian for regularization | p. 272 |
12.4.6 Transduction to induction | p. 274 |
12.5 Bibliographic notes | p. 275 |
Bibliography | p. 277 |
Index | p. 301 |