Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010319237 | P98.5.S83 M36 1999 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.
Author Notes
Christopher D. Manning is Lecturer in the Department of Linguistics at the University of Sydney, Australia. Hinrich Schutze is on the Research Staff at the Xerox Palo Alto Research Center.
Table of Contents
List of Tables | p. xv |
List of Figures | p. xxi |
Table of Notations | p. xxv |
Preface | p. xxix |
Road Map | p. xxxv |
I Preliminaries | p. 1 |
1 Introduction | p. 3 |
1.1 Ratinalist and Empiricist Approaches to Language | p. 4 |
1.2 Scientific Content | p. 7 |
1.2.1 Questions that linguistics should answer | p. 8 |
1.2.2 Non-categorical phenomena in language | p. 11 |
1.2.3 Language and cognition as probabilistic phenomena | p. 15 |
1.3 The Ambiguity of Language: Why NLP Is Difficult | p. 17 |
1.4 Dirty Hands | p. 19 |
1.4.1 Lexical resources | p. 19 |
1.4.2 Word counts | p. 20 |
1.4.3 Zipf's laws | p. 23 |
1.4.4 Collocations | p. 29 |
1.4.5 Concordances | p. 31 |
1.5 Further Reading | p. 34 |
1.6 Exercises | p. 35 |
2 Mathematical Foundations | p. 39 |
2.1 Elementary Probability Theory | p. 40 |
2.1.1 Probability spaces | p. 40 |
2.1.2 Conditional probability and independence | p. 42 |
2.1.3 Bayes' theorem | p. 43 |
2.1.4 Random variables | p. 45 |
2.1.5 Expectation and variance | p. 46 |
2.1.6 Notation | p. 47 |
2.1.7 Joint and conditional distributions | p. 48 |
2.1.8 Determining P | p. 48 |
2.1.9 Standard distributions | p. 50 |
2.1.10 Bayesian statistics | p. 54 |
2.1.11 Exercises | p. 59 |
2.2 Essential Information Theory | p. 60 |
2.2.1 Entropy | p. 61 |
2.2.2 Joint entropy and conditional entropy | p. 63 |
2.2.3 Mutual information | p. 66 |
2.2.4 The noisy channel model | p. 68 |
2.2.5 Relative entropy or Kullback-Leibler divergence | p. 72 |
2.2.6 The relation to language: Cross entropy | p. 73 |
2.2.7 The entropy of English | p. 76 |
2.2.8 Perplexity | p. 78 |
2.2.9 Exercises | p. 78 |
2.3 Further Reading | p. 79 |
3 Linguistic Essentials | p. 81 |
3.1 Parts of Speech and Morphology | p. 81 |
3.1.1 Nouns and pronouns | p. 83 |
3.1.2 Words that accompany nouns: Determiners and adjectives | p. 87 |
3.1.3 Verbs | p. 88 |
3.1.4 Other parts of speech | p. 91 |
3.2 Phrase Structure | p. 93 |
3.2.1 Phrase structure grammars | p. 96 |
3.2.2 Dependency: Arguments and adjuncts | p. 101 |
3.2.3 X' theory | p. 106 |
3.2.4 Phrase structure ambiguity | p. 107 |
3.3 Semantics and Pragmatics | p. 109 |
3.4 Other Areas | p. 112 |
3.5 Further Reading | p. 113 |
3.6 Exercises | p. 114 |
4 Corpus-Based Work | p. 117 |
4.1 Getting Set Up | p. 118 |
4.1.1 Computers | p. 118 |
4.1.2 Corpora | p. 118 |
4.1.3 Software | p. 120 |
4.2 Looking at Text | p. 123 |
4.2.1 Low-level formatting issues | p. 123 |
4.2.2 Tokenization: What is a word? | p. 124 |
4.2.3 Morphology | p. 131 |
4.2.4 Sentences | p. 134 |
4.3 Marked-up Data | p. 136 |
4.3.1 Markup schemes | p. 137 |
4.3.2 Grammatical tagging | p. 139 |
4.4 Further Reading | p. 145 |
4.5 Exercises | p. 147 |
II Words | p. 149 |
5 Collocations | p. 151 |
5.1 Frequency | p. 153 |
5.2 Mean and Variance | p. 157 |
5.3 Hypothesis Testing | p. 162 |
5.3.1 The t test | p. 163 |
5.3.2 Hypothesis testing of differences | p. 166 |
5.3.3 Pearson's chi-square test | p. 169 |
5.3.4 Likelihood ratios | p. 172 |
5.4 Mutual Information | p. 178 |
5.5 The Notion of Collocation | p. 183 |
5.6 Further Reading | p. 187 |
6 Statistical Inference: n-gram Models over Sparse Data | p. 191 |
6.1 Bins: Forming Equivalence Classes | p. 192 |
6.1.1 Reliability vs. discrimination | p. 192 |
6.1.2 n-gram models | p. 192 |
6.1.3 Building n-gram models | p. 195 |
6.2 Statistical Estimators | p. 196 |
6.2.1 Maximum Likelihood Estimation (MLE) | p. 197 |
6.2.2 Laplace's law, Lidstone's law and the Jeffreys-Perks law | p. 202 |
6.2.3 Held out estimation | p. 205 |
6.2.4 Cross-validation (deleted estimation) | p. 210 |
6.2.5 Good-Turing estimation | p. 212 |
6.2.6 Briefly noted | p. 216 |
6.3 Combining Estimators | p. 217 |
6.3.1 Simple linear interpolation | p. 218 |
6.3.2 Katz's backing-off | p. 219 |
6.3.3 General linear interpolation | p. 220 |
6.3.4 Briefly noted | p. 222 |
6.3.5 Language models for Austen | p. 223 |
6.4 Conclusions | p. 224 |
6.5 Further Reading | p. 225 |
6.6 Exercises | p. 225 |
7 Word Sense Disambiguation | p. 229 |
7.1 Methodological Preliminaries | p. 232 |
7.1.1 Supervised and unsupervised learning | p. 232 |
7.1.2 Pseudowords | p. 233 |
7.1.3 Upper and lower bounds on performance | p. 233 |
7.2 Supervised Disambiguation | p. 235 |
7.2.1 Bayesian classification | p. 235 |
7.2.2 An information-theoretic approach | p. 239 |
7.3 Dictionary-Based Disambiguation | p. 241 |
7.3.1 Disambiguation based on sense definitions | p. 242 |
7.3.2 Thesaurus-based disambiguation | p. 244 |
7.3.3 Disambiguation based on translations in a second-language corpus | p. 247 |
7.3.4 One sense per discourse, one sense per collocation | p. 249 |
7.4 Unsupervised Disambiguation | p. 252 |
7.5 What Is a Word Sense? | p. 256 |
7.6 Further Reading | p. 260 |
7.7 Exercises | p. 262 |
8 Lexical Acquisition | p. 265 |
8.1 Evaluation Measures | p. 267 |
8.2 Verb Subcategorization | p. 271 |
8.3 Attachment Ambiguity | p. 278 |
8.3.1 Hindle and Rooth (1993) | p. 280 |
8.3.2 General remarks on PP attachment | p. 284 |
8.4 Selectional Preferences | p. 288 |
8.5 Semantic Similarity | p. 294 |
8.5.1 Vector space measures | p. 296 |
8.5.2 Probabilistic measures | p. 303 |
8.6 The Role of Lexical Acquisition in Statistical NLP | p. 308 |
8.7 Further Reading | p. 312 |
III Grammar | p. 315 |
9 Markov Models | p. 317 |
9.1 Markov Models | p. 318 |
9.2 Hidden Markov Models | p. 320 |
9.2.1 Why use HMMs? | p. 322 |
9.2.2 General form of an HMM | p. 324 |
9.3 The Three Fundamental Questions for HMMs | p. 325 |
9.3.1 Finding the probability of an observation | p. 326 |
9.3.2 Finding the best state sequence | p. 331 |
9.3.3 The third problem: Parameter estimation | p. 333 |
9.4 HMMs: Implementation, Properties, and Variants | p. 336 |
9.4.1 Implementation | p. 336 |
9.4.2 Variants | p. 337 |
9.4.3 Multiple input observations | p. 338 |
9.4.4 Initialization of parameter values | p. 339 |
9.5 Further Reading | p. 339 |
10 Part-of-Speech Tagging | p. 341 |
10.1 The Information Sources in Tagging | p. 343 |
10.2 Markov Model Taggers | p. 345 |
10.2.1 The probabilistic model | p. 345 |
10.2.2 The Viterbi algorithm | p. 349 |
10.2.3 Variations | p. 351 |
10.3 Hidden Markov Model Taggers | p. 356 |
10.3.1 Applying HMMs to POS tagging | p. 357 |
10.3.2 The effect of initialization on HMM training | p. 359 |
10.4 Transformation-Based Learning of Tags | p. 361 |
10.4.1 Transformations | p. 362 |
10.4.2 The learning algorithm | p. 364 |
10.4.3 Relation to other models | p. 365 |
10.4.4 Automata | p. 367 |
10.4.5 Summary | p. 369 |
10.5 Other Methods, Other Languages | p. 370 |
10.5.1 Other approaches to tagging | p. 370 |
10.5.2 Languages other than English | p. 371 |
10.6 Tagging Accuracy and Uses of Taggers | p. 371 |
10.6.1 Tagging accuracy | p. 371 |
10.6.2 Applications of tagging | p. 374 |
10.7 Further Reading | p. 377 |
10.8 Exercises | p. 379 |
11 Probabilistic Context Free Grammars | p. 381 |
11.1 Some Features of PCFGs | p. 386 |
11.2 Questions for PCFGs | p. 388 |
11.3 The Probability of a String | p. 392 |
11.3.1 Using inside probabilities | p. 392 |
11.3.2 Using outside probabilities | p. 394 |
11.3.3 Finding the most likely parse for a sentence | p. 396 |
11.3.4 Training a PCFG | p. 398 |
11.4 Problems with the Inside-Outside Algorithm | p. 401 |
11.5 Further Reading | p. 402 |
11.6 Exercises | p. 404 |
12 Probabilistic Parsing | p. 407 |
12.1 Some Concepts | p. 408 |
12.1.1 Parsing for disambiguation | p. 408 |
12.1.2 Treebanks | p. 412 |
12.1.3 Parsing models vs. language models | p. 414 |
12.1.4 Weakening the independence assumptions of PCFGs | p. 416 |
12.1.5 Tree probabilities and derivational probabilities | p. 421 |
12.1.6 There's more than one way to do it | p. 423 |
12.1.7 Phrase structure grammars and dependency grammars | p. 428 |
12.1.8 Evaluation | p. 431 |
12.1.9 Equivalent models | p. 437 |
12.1.10 Building parsers: Search methods | p. 439 |
12.1.11 Use of the geometric mean | p. 442 |
12.2 Some Approaches | p. 443 |
12.2.1 Non-lexicalized treebank grammars | p. 443 |
12.2.2 Lexicalized models using derivational histories | p. 448 |
12.2.3 Dependency-based models | p. 451 |
12.2.4 Discussion | p. 454 |
12.3 Further Reading | p. 456 |
12.4 Exercises | p. 458 |
IV Applications and Techniques | p. 461 |
13 Statistical Alignment and Machine Translation | p. 463 |
13.1 Text Alignment | p. 466 |
13.1.1 Aligning sentences and paragraphs | p. 467 |
13.1.2 Length-based methods | p. 471 |
13.1.3 Offset alignment by signal processing techniques | p. 475 |
13.1.4 Lexical methods of sentence alignment | p. 478 |
13.1.5 Summary | p. 484 |
13.1.6 Exercises | p. 484 |
13.2 Word Alignment | p. 484 |
13.3 Statistical Machine Translation | p. 486 |
13.4 Further Reading | p. 492 |
14 Clustering | p. 495 |
14.1 Hierarchical Clustering | p. 500 |
14.1.1 Single-link and complete-link clustering | p. 503 |
14.1.2 Group-average agglomerative clustering | p. 507 |
14.1.3 An application: Improving a language model | p. 509 |
14.1.4 Top-down clustering | p. 512 |
14.2 Non-Hierarchical Clustering | p. 514 |
14.2.1 K-means | p. 515 |
14.2.2 The EM algorithm | p. 518 |
14.3 Further Reading | p. 527 |
14.4 Exercises | p. 528 |
15 Topics in Information Retrieval | p. 529 |
15.1 Some Background on Information Retrieval | p. 530 |
15.1.1 Common design features of IR systems | p. 532 |
15.1.2 Evaluation measures | p. 534 |
15.1.3 The probability ranking principle (PRP) | p. 538 |
15.2 The Vector Space Model | p. 539 |
15.2.1 Vector similarity | p. 540 |
15.2.2 Term weighting | p. 541 |
15.3 Term Distribution Models | p. 544 |
15.3.1 The Poisson distribution | p. 545 |
15.3.2 The two-Poisson model | p. 548 |
15.3.3 The K mixture | p. 549 |
15.3.4 Inverse document frequency | p. 551 |
15.3.5 Residual inverse document frequency | p. 553 |
15.3.6 Usage of term distribution models | p. 554 |
15.4 Latent Semantic Indexing | p. 554 |
15.4.1 Least-squares methods | p. 557 |
15.4.2 Singular Value Decomposition | p. 558 |
15.4.3 Latent Semantic Indexing in IR | p. 564 |
15.5 Discourse Segmentation | p. 566 |
15.5.1 TextTiling | p. 567 |
15.6 Further Reading | p. 570 |
15.7 Exercises | p. 573 |
16 Text Categorization | p. 575 |
16.1 Decision Trees | p. 578 |
16.2 Maximum Entropy Modeling | p. 589 |
16.2.1 Generalized iterative scaling | p. 591 |
16.2.2 Application to text categorization | p. 594 |
16.3 Perceptrons | p. 597 |
16.4 k Nearest Neighbor Classification | p. 604 |
16.5 Further Reading | p. 607 |
Tiny Statistical Tables | p. 609 |
Bibliography | p. 611 |
Index | p. 657 |