Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010282977 | QA76.9.T48 M36 2008 | Open Access Book | Book | Searching... |
Searching... | 30000010293616 | QA76.9.T48 M36 2008 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
Class-tested and coherent, this groundbreaking new textbook teaches web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. Written from a computer science perspective by three leading experts in the field, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Although originally designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also create a buzz for researchers and professionals alike.
Author Notes
Christopher D. Manning is Associate Professor of Computer Science and Linguistics at Stanford University
Prabhakar Raghavan is Head of Yahoo! Research and a Consulting Professor of Computer Science at Stanford University
Hinrich Schutze is Chair of Theoretical Computational Linguistics at the Institute for Natural Language Processing, University of Stuttgart
Reviews 1
Choice Review
This recent book on information retrieval (IR) is a timely one: no serious IR work has been published in years, perhaps decades, even though one would expect a flood--as the fundamental theory behind search, it is a very hot topic today. Manning (Stanford Univ.), Raghavan (Yahoo! Inc.), and Schutze (Univ. of Stuttgart, Germany) cover most of the important issues in IR very well, and offer more than enough coverage for a one-semester introductory IR course. Somewhat lacking is any serious discussion of non-text IR (images, sound, video, multimedia in general). This is an important, up-and-coming topic, and one that has not been adequately addressed in the literature. On the one hand, this introductory volume may not be the appropriate vehicle for such an advanced topic. On the other hand, while this reviewer was teaching an introductory course, he still allocated some time to give an overview of what he thinks is a major future challenge. Thus, it would have been nice to have this subject addressed here. But overall, the book serves its intended purpose quite well, and this reviewer plans to use it again in his next IR course. Summing Up: Highly recommended. Professional and academic collections, upper-division undergraduate and above. H. Levkowitz University of Massachusetts
Table of Contents
Table of Notation | p. xi |
Preface | p. xv |
1 Boolean retrieval | p. 1 |
1.1 An example information retrieval problem | p. 3 |
1.2 A first take at building an inverted index | p. 6 |
1.3 Processing Boolean queries | p. 9 |
1.4 The extended Boolean model versus ranked retrieval | p. 13 |
1.5 References and further reading | p. 16 |
2 The term vocabulary and postings lists | p. 18 |
2.1 Document delineation and character sequence decoding | p. 18 |
2.2 Determining the vocabulary of terms | p. 21 |
2.3 Faster postings list intersection via skip pointers | p. 33 |
2.4 Positional postings and phrase queries | p. 36 |
2.5 References and further reading | p. 43 |
3 Dictionaries and tolerant retrieval | p. 45 |
3.1 Search structures for dictionaries | p. 45 |
3.2 Wildcard queries | p. 48 |
3.3 Spelling correction | p. 52 |
3.4 Phonetic correction | p. 58 |
3.5 References and further reading | p. 59 |
4 Index construction | p. 61 |
4.1 Hardware basics | p. 62 |
4.2 Blocked sort-based indexing | p. 63 |
4.3 Single-pass in-memory indexing | p. 66 |
4.4 Distributed indexing | p. 68 |
4.5 Dynamic indexing | p. 71 |
4.6 Other types of indexes | p. 73 |
4.7 References and further reading | p. 76 |
5 Index compression | p. 78 |
5.1 Statistical properties of terms in information retrieval | p. 79 |
5.2 Dictionary compression | p. 82 |
5.3 Postings file compression | p. 87 |
5.4 References and further reading | p. 97 |
6 Scoring, term weighting, and the vector space model | p. 100 |
6.1 Parametric and zone indexes | p. 101 |
6.2 Term frequency and weighting | p. 107 |
6.3 The vector space model for scoring | p. 110 |
6.4 Variant tf-idf functions | p. 116 |
6.5 References and further reading | p. 122 |
7 Computing scores in a complete search system | p. 124 |
7.1 Efficient scoring and ranking | p. 124 |
7.2 Components of an information retrieval system | p. 132 |
7.3 Vector space scoring and query operator interaction | p. 136 |
7.4 References and further reading | p. 137 |
8 Evaluation in information retrieval | p. 139 |
8.1 Information retrieval system evaluation | p. 140 |
8.2 Standard test collections | p. 141 |
8.3 Evaluation of unranked retrieval sets | p. 142 |
8.4 Evaluation of ranked retrieval results | p. 145 |
8.5 Assessing relevance | p. 151 |
8.6 A broader perspective: System quality and user utility | p. 154 |
8.7 Results snippets | p. 157 |
8.8 References and further reading | p. 159 |
9 Relevance feedback and query expansion | p. 162 |
9.1 Relevance feedback and pseudo relevance feedback | p. 163 |
9.2 Global methods for query reformulation | p. 173 |
9.3 References and further reading | p. 177 |
10 XML retrieval | p. 178 |
10.1 Basic XML concepts | p. 180 |
10.2 Challenges in XML retrieval | p. 183 |
10.3 A vector space model for XML retrieval | p. 188 |
10.4 Evaluation of XML retrieval | p. 192 |
10.5 Text-centric versus data-centric XML retrieval | p. 196 |
10.6 References and further reading | p. 198 |
11 Probabilistic information retrieval | p. 201 |
11.1 Review of basic probability theory | p. 202 |
11.2 The probability ranking principle | p. 203 |
11.3 The binary independence model | p. 204 |
11.4 An appraisal and some extensions | p. 212 |
11.5 References and further reading | p. 216 |
12 Language models for information retrieval | p. 218 |
12.1 Language models | p. 218 |
12.2 The query likelihood model | p. 223 |
12.3 Language modeling versus other approaches in information retrieval | p. 229 |
12.4 Extended language modeling approaches | p. 230 |
12.5 References and further reading | p. 232 |
13 Text classification and Naive Bayes | p. 234 |
13.1 The text classification problem | p. 237 |
13.2 Naive Bayes text classification | p. 238 |
13.3 The Bernoulli model | p. 243 |
13.4 Properties of Naive Bayes | p. 245 |
13.5 Feature selection | p. 251 |
13.6 Evaluation of text classification | p. 258 |
13.7 References and further reading | p. 264 |
14 Vector space classification | p. 266 |
14.1 Document representations and measures of relatedness in vector spaces | p. 267 |
14.2 Rocchio classification | p. 269 |
14.3 k nearest neighbor | p. 273 |
14.4 Linear versus nonlinear classifiers | p. 277 |
14.5 Classification with more than two classes | p. 281 |
14.6 The bias-variance tradeoff | p. 284 |
14.7 References and further reading | p. 291 |
15 Support vector machines and machine learning on documents | p. 293 |
15.1 Support vector machines: The linearly separable case | p. 294 |
15.2 Extensions to the support vector machine model | p. 300 |
15.3 Issues in the classification of text documents | p. 307 |
15.4 Machine-learning methods in ad hoc information retrieval | p. 314 |
15.5 References and further reading | p. 318 |
16 Flat clustering | p. 321 |
16.1 Clustering in information retrieval | p. 322 |
16.2 Problem statement | p. 326 |
16.3 Evaluation of clustering | p. 327 |
16.4 K-means | p. 331 |
16.5 Model-based clustering | p. 338 |
16.6 References and further reading | p. 343 |
17 Hierarchical clustering | p. 346 |
17.1 Hierarchical agglomerative clustering | p. 347 |
17.2 Single-link and complete-link clustering | p. 350 |
17.3 Group-average agglomerative clustering | p. 356 |
17.4 Centroid clustering | p. 358 |
17.5 Optimality of hierarchical agglomerative clustering | p. 360 |
17.6 Divisive clustering | p. 362 |
17.7 Cluster labeling | p. 363 |
17.8 Implementation notes | p. 365 |
17.9 References and further reading | p. 367 |
18 Matrix decompositions and latent semantic indexing | p. 369 |
18.1 Linear algebra review | p. 369 |
18.2 Term-document matrices and singular value decompositions | p. 373 |
18.3 Low-rank approximations | p. 376 |
18.4 Latent semantic indexing | p. 378 |
18.5 References and further reading | p. 383 |
19 Web search basics | p. 385 |
19.1 Background and history | p. 385 |
19.2 Web characteristics | p. 387 |
19.3 Advertising as the economic model | p. 392 |
19.4 The search user experience | p. 395 |
19.5 Index size and estimation | p. 396 |
19.6 Near-duplicates and shingling | p. 400 |
19.7 References and further reading | p. 404 |
20 Web crawling and indexes | p. 405 |
20.1 Overview | p. 405 |
20.2 Crawling | p. 406 |
20.3 Distributing indexes | p. 415 |
20.4 Connectivity servers | p. 416 |
20.5 References and further reading | p. 419 |
21 Link analysis | p. 421 |
21.1 The Web as a graph | p. 422 |
21.2 PageRank | p. 424 |
21.3 Hubs and authorities | p. 433 |
21.4 References and further reading | p. 439 |
Bibliography | p. 441 |
Index | p. 469 |