Foundations of statistical natural language processing

Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.

Author Notes

Christopher D. Manning is Lecturer in the Department of Linguistics at the University of Sydney, Australia. Hinrich Schutze is on the Research Staff at the Xerox Palo Alto Research Center.

List of Tables	p. xv
List of Figures	p. xxi
Table of Notations	p. xxv
Preface	p. xxix
Road Map	p. xxxv
I Preliminaries	p. 1
1 Introduction	p. 3
1.1 Ratinalist and Empiricist Approaches to Language	p. 4
1.2 Scientific Content	p. 7
1.2.1 Questions that linguistics should answer	p. 8
1.2.2 Non-categorical phenomena in language	p. 11
1.2.3 Language and cognition as probabilistic phenomena	p. 15
1.3 The Ambiguity of Language: Why NLP Is Difficult	p. 17
1.4 Dirty Hands	p. 19
1.4.1 Lexical resources	p. 19
1.4.2 Word counts	p. 20
1.4.3 Zipf's laws	p. 23
1.4.4 Collocations	p. 29
1.4.5 Concordances	p. 31
1.5 Further Reading	p. 34
1.6 Exercises	p. 35
2 Mathematical Foundations	p. 39
2.1 Elementary Probability Theory	p. 40
2.1.1 Probability spaces	p. 40
2.1.2 Conditional probability and independence	p. 42
2.1.3 Bayes' theorem	p. 43
2.1.4 Random variables	p. 45
2.1.5 Expectation and variance	p. 46
2.1.6 Notation	p. 47
2.1.7 Joint and conditional distributions	p. 48
2.1.8 Determining P	p. 48
2.1.9 Standard distributions	p. 50
2.1.10 Bayesian statistics	p. 54
2.1.11 Exercises	p. 59
2.2 Essential Information Theory	p. 60
2.2.1 Entropy	p. 61
2.2.2 Joint entropy and conditional entropy	p. 63
2.2.3 Mutual information	p. 66
2.2.4 The noisy channel model	p. 68
2.2.5 Relative entropy or Kullback-Leibler divergence	p. 72
2.2.6 The relation to language: Cross entropy	p. 73
2.2.7 The entropy of English	p. 76
2.2.8 Perplexity	p. 78
2.2.9 Exercises	p. 78
2.3 Further Reading	p. 79
3 Linguistic Essentials	p. 81
3.1 Parts of Speech and Morphology	p. 81
3.1.1 Nouns and pronouns	p. 83
3.1.2 Words that accompany nouns: Determiners and adjectives	p. 87
3.1.3 Verbs	p. 88
3.1.4 Other parts of speech	p. 91
3.2 Phrase Structure	p. 93
3.2.1 Phrase structure grammars	p. 96
3.2.2 Dependency: Arguments and adjuncts	p. 101
3.2.3 X' theory	p. 106
3.2.4 Phrase structure ambiguity	p. 107
3.3 Semantics and Pragmatics	p. 109
3.4 Other Areas	p. 112
3.5 Further Reading	p. 113
3.6 Exercises	p. 114
4 Corpus-Based Work	p. 117
4.1 Getting Set Up	p. 118
4.1.1 Computers	p. 118
4.1.2 Corpora	p. 118
4.1.3 Software	p. 120
4.2 Looking at Text	p. 123
4.2.1 Low-level formatting issues	p. 123
4.2.2 Tokenization: What is a word?	p. 124
4.2.3 Morphology	p. 131
4.2.4 Sentences	p. 134
4.3 Marked-up Data	p. 136
4.3.1 Markup schemes	p. 137
4.3.2 Grammatical tagging	p. 139
4.4 Further Reading	p. 145
4.5 Exercises	p. 147
II Words	p. 149
5 Collocations	p. 151
5.1 Frequency	p. 153
5.2 Mean and Variance	p. 157
5.3 Hypothesis Testing	p. 162
5.3.1 The t test	p. 163
5.3.2 Hypothesis testing of differences	p. 166
5.3.3 Pearson's chi-square test	p. 169
5.3.4 Likelihood ratios	p. 172
5.4 Mutual Information	p. 178
5.5 The Notion of Collocation	p. 183
5.6 Further Reading	p. 187
6 Statistical Inference: n-gram Models over Sparse Data	p. 191
6.1 Bins: Forming Equivalence Classes	p. 192
6.1.1 Reliability vs. discrimination	p. 192
6.1.2 n-gram models	p. 192
6.1.3 Building n-gram models	p. 195
6.2 Statistical Estimators	p. 196
6.2.1 Maximum Likelihood Estimation (MLE)	p. 197
6.2.2 Laplace's law, Lidstone's law and the Jeffreys-Perks law	p. 202
6.2.3 Held out estimation	p. 205
6.2.4 Cross-validation (deleted estimation)	p. 210
6.2.5 Good-Turing estimation	p. 212
6.2.6 Briefly noted	p. 216
6.3 Combining Estimators	p. 217
6.3.1 Simple linear interpolation	p. 218
6.3.2 Katz's backing-off	p. 219
6.3.3 General linear interpolation	p. 220
6.3.4 Briefly noted	p. 222
6.3.5 Language models for Austen	p. 223
6.4 Conclusions	p. 224
6.5 Further Reading	p. 225
6.6 Exercises	p. 225
7 Word Sense Disambiguation	p. 229
7.1 Methodological Preliminaries	p. 232
7.1.1 Supervised and unsupervised learning	p. 232
7.1.2 Pseudowords	p. 233
7.1.3 Upper and lower bounds on performance	p. 233
7.2 Supervised Disambiguation	p. 235
7.2.1 Bayesian classification	p. 235
7.2.2 An information-theoretic approach	p. 239
7.3 Dictionary-Based Disambiguation	p. 241
7.3.1 Disambiguation based on sense definitions	p. 242
7.3.2 Thesaurus-based disambiguation	p. 244
7.3.3 Disambiguation based on translations in a second-language corpus	p. 247
7.3.4 One sense per discourse, one sense per collocation	p. 249
7.4 Unsupervised Disambiguation	p. 252
7.5 What Is a Word Sense?	p. 256
7.6 Further Reading	p. 260
7.7 Exercises	p. 262
8 Lexical Acquisition	p. 265
8.1 Evaluation Measures	p. 267
8.2 Verb Subcategorization	p. 271
8.3 Attachment Ambiguity	p. 278
8.3.1 Hindle and Rooth (1993)	p. 280
8.3.2 General remarks on PP attachment	p. 284
8.4 Selectional Preferences	p. 288
8.5 Semantic Similarity	p. 294
8.5.1 Vector space measures	p. 296
8.5.2 Probabilistic measures	p. 303
8.6 The Role of Lexical Acquisition in Statistical NLP	p. 308
8.7 Further Reading	p. 312
III Grammar	p. 315
9 Markov Models	p. 317
9.1 Markov Models	p. 318
9.2 Hidden Markov Models	p. 320
9.2.1 Why use HMMs?	p. 322
9.2.2 General form of an HMM	p. 324
9.3 The Three Fundamental Questions for HMMs	p. 325
9.3.1 Finding the probability of an observation	p. 326
9.3.2 Finding the best state sequence	p. 331
9.3.3 The third problem: Parameter estimation	p. 333
9.4 HMMs: Implementation, Properties, and Variants	p. 336
9.4.1 Implementation	p. 336
9.4.2 Variants	p. 337
9.4.3 Multiple input observations	p. 338
9.4.4 Initialization of parameter values	p. 339
9.5 Further Reading	p. 339
10 Part-of-Speech Tagging	p. 341
10.1 The Information Sources in Tagging	p. 343
10.2 Markov Model Taggers	p. 345
10.2.1 The probabilistic model	p. 345
10.2.2 The Viterbi algorithm	p. 349
10.2.3 Variations	p. 351
10.3 Hidden Markov Model Taggers	p. 356
10.3.1 Applying HMMs to POS tagging	p. 357
10.3.2 The effect of initialization on HMM training	p. 359
10.4 Transformation-Based Learning of Tags	p. 361
10.4.1 Transformations	p. 362
10.4.2 The learning algorithm	p. 364
10.4.3 Relation to other models	p. 365
10.4.4 Automata	p. 367
10.4.5 Summary	p. 369
10.5 Other Methods, Other Languages	p. 370
10.5.1 Other approaches to tagging	p. 370
10.5.2 Languages other than English	p. 371
10.6 Tagging Accuracy and Uses of Taggers	p. 371
10.6.1 Tagging accuracy	p. 371
10.6.2 Applications of tagging	p. 374
10.7 Further Reading	p. 377
10.8 Exercises	p. 379
11 Probabilistic Context Free Grammars	p. 381
11.1 Some Features of PCFGs	p. 386
11.2 Questions for PCFGs	p. 388
11.3 The Probability of a String	p. 392
11.3.1 Using inside probabilities	p. 392
11.3.2 Using outside probabilities	p. 394
11.3.3 Finding the most likely parse for a sentence	p. 396
11.3.4 Training a PCFG	p. 398
11.4 Problems with the Inside-Outside Algorithm	p. 401
11.5 Further Reading	p. 402
11.6 Exercises	p. 404
12 Probabilistic Parsing	p. 407
12.1 Some Concepts	p. 408
12.1.1 Parsing for disambiguation	p. 408
12.1.2 Treebanks	p. 412
12.1.3 Parsing models vs. language models	p. 414
12.1.4 Weakening the independence assumptions of PCFGs	p. 416
12.1.5 Tree probabilities and derivational probabilities	p. 421
12.1.6 There's more than one way to do it	p. 423
12.1.7 Phrase structure grammars and dependency grammars	p. 428
12.1.8 Evaluation	p. 431
12.1.9 Equivalent models	p. 437
12.1.10 Building parsers: Search methods	p. 439
12.1.11 Use of the geometric mean	p. 442
12.2 Some Approaches	p. 443
12.2.1 Non-lexicalized treebank grammars	p. 443
12.2.2 Lexicalized models using derivational histories	p. 448
12.2.3 Dependency-based models	p. 451
12.2.4 Discussion	p. 454
12.3 Further Reading	p. 456
12.4 Exercises	p. 458
IV Applications and Techniques	p. 461
13 Statistical Alignment and Machine Translation	p. 463
13.1 Text Alignment	p. 466
13.1.1 Aligning sentences and paragraphs	p. 467
13.1.2 Length-based methods	p. 471
13.1.3 Offset alignment by signal processing techniques	p. 475
13.1.4 Lexical methods of sentence alignment	p. 478
13.1.5 Summary	p. 484
13.1.6 Exercises	p. 484
13.2 Word Alignment	p. 484
13.3 Statistical Machine Translation	p. 486
13.4 Further Reading	p. 492
14 Clustering	p. 495
14.1 Hierarchical Clustering	p. 500
14.1.1 Single-link and complete-link clustering	p. 503
14.1.2 Group-average agglomerative clustering	p. 507
14.1.3 An application: Improving a language model	p. 509
14.1.4 Top-down clustering	p. 512
14.2 Non-Hierarchical Clustering	p. 514
14.2.1 K-means	p. 515
14.2.2 The EM algorithm	p. 518
14.3 Further Reading	p. 527
14.4 Exercises	p. 528
15 Topics in Information Retrieval	p. 529
15.1 Some Background on Information Retrieval	p. 530
15.1.1 Common design features of IR systems	p. 532
15.1.2 Evaluation measures	p. 534
15.1.3 The probability ranking principle (PRP)	p. 538
15.2 The Vector Space Model	p. 539
15.2.1 Vector similarity	p. 540
15.2.2 Term weighting	p. 541
15.3 Term Distribution Models	p. 544
15.3.1 The Poisson distribution	p. 545
15.3.2 The two-Poisson model	p. 548
15.3.3 The K mixture	p. 549
15.3.4 Inverse document frequency	p. 551
15.3.5 Residual inverse document frequency	p. 553
15.3.6 Usage of term distribution models	p. 554
15.4 Latent Semantic Indexing	p. 554
15.4.1 Least-squares methods	p. 557
15.4.2 Singular Value Decomposition	p. 558
15.4.3 Latent Semantic Indexing in IR	p. 564
15.5 Discourse Segmentation	p. 566
15.5.1 TextTiling	p. 567
15.6 Further Reading	p. 570
15.7 Exercises	p. 573
16 Text Categorization	p. 575
16.1 Decision Trees	p. 578
16.2 Maximum Entropy Modeling	p. 589
16.2.1 Generalized iterative scaling	p. 591
16.2.2 Application to text categorization	p. 594
16.3 Perceptrons	p. 597
16.4 k Nearest Neighbor Classification	p. 604
16.5 Further Reading	p. 607
Tiny Statistical Tables	p. 609
Bibliography	p. 611
Index	p. 657

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents