Cover image for Foundations of statistical natural language processing
Title:
Foundations of statistical natural language processing
Personal Author:
Publication Information:
Cambridge, Mass. : MIT Press, c1999
Physical Description:
xxxvii, 680 p. ; 24 cm.
ISBN:
9780262133609
Added Author:

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010319237 P98.5.S83 M36 1999 Open Access Book Book
Searching...

On Order

Summary

Summary

Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.


Author Notes

Christopher D. Manning is Lecturer in the Department of Linguistics at the University of Sydney, Australia. Hinrich Schutze is on the Research Staff at the Xerox Palo Alto Research Center.


Table of Contents

List of Tablesp. xv
List of Figuresp. xxi
Table of Notationsp. xxv
Prefacep. xxix
Road Mapp. xxxv
I Preliminariesp. 1
1 Introductionp. 3
1.1 Ratinalist and Empiricist Approaches to Languagep. 4
1.2 Scientific Contentp. 7
1.2.1 Questions that linguistics should answerp. 8
1.2.2 Non-categorical phenomena in languagep. 11
1.2.3 Language and cognition as probabilistic phenomenap. 15
1.3 The Ambiguity of Language: Why NLP Is Difficultp. 17
1.4 Dirty Handsp. 19
1.4.1 Lexical resourcesp. 19
1.4.2 Word countsp. 20
1.4.3 Zipf's lawsp. 23
1.4.4 Collocationsp. 29
1.4.5 Concordancesp. 31
1.5 Further Readingp. 34
1.6 Exercisesp. 35
2 Mathematical Foundationsp. 39
2.1 Elementary Probability Theoryp. 40
2.1.1 Probability spacesp. 40
2.1.2 Conditional probability and independencep. 42
2.1.3 Bayes' theoremp. 43
2.1.4 Random variablesp. 45
2.1.5 Expectation and variancep. 46
2.1.6 Notationp. 47
2.1.7 Joint and conditional distributionsp. 48
2.1.8 Determining Pp. 48
2.1.9 Standard distributionsp. 50
2.1.10 Bayesian statisticsp. 54
2.1.11 Exercisesp. 59
2.2 Essential Information Theoryp. 60
2.2.1 Entropyp. 61
2.2.2 Joint entropy and conditional entropyp. 63
2.2.3 Mutual informationp. 66
2.2.4 The noisy channel modelp. 68
2.2.5 Relative entropy or Kullback-Leibler divergencep. 72
2.2.6 The relation to language: Cross entropyp. 73
2.2.7 The entropy of Englishp. 76
2.2.8 Perplexityp. 78
2.2.9 Exercisesp. 78
2.3 Further Readingp. 79
3 Linguistic Essentialsp. 81
3.1 Parts of Speech and Morphologyp. 81
3.1.1 Nouns and pronounsp. 83
3.1.2 Words that accompany nouns: Determiners and adjectivesp. 87
3.1.3 Verbsp. 88
3.1.4 Other parts of speechp. 91
3.2 Phrase Structurep. 93
3.2.1 Phrase structure grammarsp. 96
3.2.2 Dependency: Arguments and adjunctsp. 101
3.2.3 X' theoryp. 106
3.2.4 Phrase structure ambiguityp. 107
3.3 Semantics and Pragmaticsp. 109
3.4 Other Areasp. 112
3.5 Further Readingp. 113
3.6 Exercisesp. 114
4 Corpus-Based Workp. 117
4.1 Getting Set Upp. 118
4.1.1 Computersp. 118
4.1.2 Corporap. 118
4.1.3 Softwarep. 120
4.2 Looking at Textp. 123
4.2.1 Low-level formatting issuesp. 123
4.2.2 Tokenization: What is a word?p. 124
4.2.3 Morphologyp. 131
4.2.4 Sentencesp. 134
4.3 Marked-up Datap. 136
4.3.1 Markup schemesp. 137
4.3.2 Grammatical taggingp. 139
4.4 Further Readingp. 145
4.5 Exercisesp. 147
II Wordsp. 149
5 Collocationsp. 151
5.1 Frequencyp. 153
5.2 Mean and Variancep. 157
5.3 Hypothesis Testingp. 162
5.3.1 The t testp. 163
5.3.2 Hypothesis testing of differencesp. 166
5.3.3 Pearson's chi-square testp. 169
5.3.4 Likelihood ratiosp. 172
5.4 Mutual Informationp. 178
5.5 The Notion of Collocationp. 183
5.6 Further Readingp. 187
6 Statistical Inference: n-gram Models over Sparse Datap. 191
6.1 Bins: Forming Equivalence Classesp. 192
6.1.1 Reliability vs. discriminationp. 192
6.1.2 n-gram modelsp. 192
6.1.3 Building n-gram modelsp. 195
6.2 Statistical Estimatorsp. 196
6.2.1 Maximum Likelihood Estimation (MLE)p. 197
6.2.2 Laplace's law, Lidstone's law and the Jeffreys-Perks lawp. 202
6.2.3 Held out estimationp. 205
6.2.4 Cross-validation (deleted estimation)p. 210
6.2.5 Good-Turing estimationp. 212
6.2.6 Briefly notedp. 216
6.3 Combining Estimatorsp. 217
6.3.1 Simple linear interpolationp. 218
6.3.2 Katz's backing-offp. 219
6.3.3 General linear interpolationp. 220
6.3.4 Briefly notedp. 222
6.3.5 Language models for Austenp. 223
6.4 Conclusionsp. 224
6.5 Further Readingp. 225
6.6 Exercisesp. 225
7 Word Sense Disambiguationp. 229
7.1 Methodological Preliminariesp. 232
7.1.1 Supervised and unsupervised learningp. 232
7.1.2 Pseudowordsp. 233
7.1.3 Upper and lower bounds on performancep. 233
7.2 Supervised Disambiguationp. 235
7.2.1 Bayesian classificationp. 235
7.2.2 An information-theoretic approachp. 239
7.3 Dictionary-Based Disambiguationp. 241
7.3.1 Disambiguation based on sense definitionsp. 242
7.3.2 Thesaurus-based disambiguationp. 244
7.3.3 Disambiguation based on translations in a second-language corpusp. 247
7.3.4 One sense per discourse, one sense per collocationp. 249
7.4 Unsupervised Disambiguationp. 252
7.5 What Is a Word Sense?p. 256
7.6 Further Readingp. 260
7.7 Exercisesp. 262
8 Lexical Acquisitionp. 265
8.1 Evaluation Measuresp. 267
8.2 Verb Subcategorizationp. 271
8.3 Attachment Ambiguityp. 278
8.3.1 Hindle and Rooth (1993)p. 280
8.3.2 General remarks on PP attachmentp. 284
8.4 Selectional Preferencesp. 288
8.5 Semantic Similarityp. 294
8.5.1 Vector space measuresp. 296
8.5.2 Probabilistic measuresp. 303
8.6 The Role of Lexical Acquisition in Statistical NLPp. 308
8.7 Further Readingp. 312
III Grammarp. 315
9 Markov Modelsp. 317
9.1 Markov Modelsp. 318
9.2 Hidden Markov Modelsp. 320
9.2.1 Why use HMMs?p. 322
9.2.2 General form of an HMMp. 324
9.3 The Three Fundamental Questions for HMMsp. 325
9.3.1 Finding the probability of an observationp. 326
9.3.2 Finding the best state sequencep. 331
9.3.3 The third problem: Parameter estimationp. 333
9.4 HMMs: Implementation, Properties, and Variantsp. 336
9.4.1 Implementationp. 336
9.4.2 Variantsp. 337
9.4.3 Multiple input observationsp. 338
9.4.4 Initialization of parameter valuesp. 339
9.5 Further Readingp. 339
10 Part-of-Speech Taggingp. 341
10.1 The Information Sources in Taggingp. 343
10.2 Markov Model Taggersp. 345
10.2.1 The probabilistic modelp. 345
10.2.2 The Viterbi algorithmp. 349
10.2.3 Variationsp. 351
10.3 Hidden Markov Model Taggersp. 356
10.3.1 Applying HMMs to POS taggingp. 357
10.3.2 The effect of initialization on HMM trainingp. 359
10.4 Transformation-Based Learning of Tagsp. 361
10.4.1 Transformationsp. 362
10.4.2 The learning algorithmp. 364
10.4.3 Relation to other modelsp. 365
10.4.4 Automatap. 367
10.4.5 Summaryp. 369
10.5 Other Methods, Other Languagesp. 370
10.5.1 Other approaches to taggingp. 370
10.5.2 Languages other than Englishp. 371
10.6 Tagging Accuracy and Uses of Taggersp. 371
10.6.1 Tagging accuracyp. 371
10.6.2 Applications of taggingp. 374
10.7 Further Readingp. 377
10.8 Exercisesp. 379
11 Probabilistic Context Free Grammarsp. 381
11.1 Some Features of PCFGsp. 386
11.2 Questions for PCFGsp. 388
11.3 The Probability of a Stringp. 392
11.3.1 Using inside probabilitiesp. 392
11.3.2 Using outside probabilitiesp. 394
11.3.3 Finding the most likely parse for a sentencep. 396
11.3.4 Training a PCFGp. 398
11.4 Problems with the Inside-Outside Algorithmp. 401
11.5 Further Readingp. 402
11.6 Exercisesp. 404
12 Probabilistic Parsingp. 407
12.1 Some Conceptsp. 408
12.1.1 Parsing for disambiguationp. 408
12.1.2 Treebanksp. 412
12.1.3 Parsing models vs. language modelsp. 414
12.1.4 Weakening the independence assumptions of PCFGsp. 416
12.1.5 Tree probabilities and derivational probabilitiesp. 421
12.1.6 There's more than one way to do itp. 423
12.1.7 Phrase structure grammars and dependency grammarsp. 428
12.1.8 Evaluationp. 431
12.1.9 Equivalent modelsp. 437
12.1.10 Building parsers: Search methodsp. 439
12.1.11 Use of the geometric meanp. 442
12.2 Some Approachesp. 443
12.2.1 Non-lexicalized treebank grammarsp. 443
12.2.2 Lexicalized models using derivational historiesp. 448
12.2.3 Dependency-based modelsp. 451
12.2.4 Discussionp. 454
12.3 Further Readingp. 456
12.4 Exercisesp. 458
IV Applications and Techniquesp. 461
13 Statistical Alignment and Machine Translationp. 463
13.1 Text Alignmentp. 466
13.1.1 Aligning sentences and paragraphsp. 467
13.1.2 Length-based methodsp. 471
13.1.3 Offset alignment by signal processing techniquesp. 475
13.1.4 Lexical methods of sentence alignmentp. 478
13.1.5 Summaryp. 484
13.1.6 Exercisesp. 484
13.2 Word Alignmentp. 484
13.3 Statistical Machine Translationp. 486
13.4 Further Readingp. 492
14 Clusteringp. 495
14.1 Hierarchical Clusteringp. 500
14.1.1 Single-link and complete-link clusteringp. 503
14.1.2 Group-average agglomerative clusteringp. 507
14.1.3 An application: Improving a language modelp. 509
14.1.4 Top-down clusteringp. 512
14.2 Non-Hierarchical Clusteringp. 514
14.2.1 K-meansp. 515
14.2.2 The EM algorithmp. 518
14.3 Further Readingp. 527
14.4 Exercisesp. 528
15 Topics in Information Retrievalp. 529
15.1 Some Background on Information Retrievalp. 530
15.1.1 Common design features of IR systemsp. 532
15.1.2 Evaluation measuresp. 534
15.1.3 The probability ranking principle (PRP)p. 538
15.2 The Vector Space Modelp. 539
15.2.1 Vector similarityp. 540
15.2.2 Term weightingp. 541
15.3 Term Distribution Modelsp. 544
15.3.1 The Poisson distributionp. 545
15.3.2 The two-Poisson modelp. 548
15.3.3 The K mixturep. 549
15.3.4 Inverse document frequencyp. 551
15.3.5 Residual inverse document frequencyp. 553
15.3.6 Usage of term distribution modelsp. 554
15.4 Latent Semantic Indexingp. 554
15.4.1 Least-squares methodsp. 557
15.4.2 Singular Value Decompositionp. 558
15.4.3 Latent Semantic Indexing in IRp. 564
15.5 Discourse Segmentationp. 566
15.5.1 TextTilingp. 567
15.6 Further Readingp. 570
15.7 Exercisesp. 573
16 Text Categorizationp. 575
16.1 Decision Treesp. 578
16.2 Maximum Entropy Modelingp. 589
16.2.1 Generalized iterative scalingp. 591
16.2.2 Application to text categorizationp. 594
16.3 Perceptronsp. 597
16.4 k Nearest Neighbor Classificationp. 604
16.5 Further Readingp. 607
Tiny Statistical Tablesp. 609
Bibliographyp. 611
Indexp. 657