Textual information access : statistical models

This book presents statistical models that have recently been developed within several research communities to access information contained in text collections. The problems considered are linked to applications aiming at facilitating information access:
- information extraction and retrieval;
- text classification and clustering;
- opinion mining;
- comprehension aids (automatic summarization, machine translation, visualization).
In order to give the reader as complete a description as possible, the focus is placed on the probability models used in the applications concerned, by highlighting the relationship between models and applications and by illustrating the behavior of each model on real collections.
Textual Information Access is organized around four themes: informational retrieval and ranking models, classification and clustering (regression logistics, kernel methods, Markov fields, etc.), multilingualism and machine translation, and emerging applications such as information exploration.

Contents

Part 1: Information Retrieval
1. Probabilistic Models for Information Retrieval, Stéphane Clinchant and Eric Gaussier.
2. Learnable Ranking Models for Automatic Text Summarization and Information Retrieval, Massih-Réza Amini, David Buffoni, Patrick Gallinari,  Tuong Vinh Truong and Nicolas Usunier.
Part 2: Classification and Clustering
3. Logistic Regression and Text Classification, Sujeevan Aseervatham, Eric Gaussier, Anestis Antoniadis,  Michel Burlet and Yves Denneulin.
4. Kernel Methods for Textual Information Access, Jean-Michel Renders.
5. Topic-Based Generative Models for Text  Information Access, Jean-Cédric Chappelier.
6. Conditional Random Fields for Information Extraction, Isabelle Tellier and Marc Tommasi.
Part 3: Multilingualism
7. Statistical Methods for Machine Translation, Alexandre Allauzen and François Yvon.
Part 4: Emerging Applications
8. Information Mining: Methods and Interfaces for Accessing Complex Information, Josiane Mothe, Kurt Englmeier and Fionn Murtagh.
9. Opinion Detection as a Topic Classification Problem, Juan-Manuel Torres-Moreno, Marc El-Bèze, Patrice Bellot and  Fréderic Béchet.

Author Notes

Eric Gaussier has been Professor of Computer Science at Joseph Fourier University in France since September 2006. He is currently leading the AMA team, the research of which fits within the general framework of machine learning and information modeling. Since 2010, he has also been deputy director of the Grenoble Informatics Laboratory, one of the largest Computer Science laboratories in France.
Franois Yvon is Professor of Computer Science at the University of Paris Sud in Orsay, France and a member of the Spoken Language Processing group of LIMSI/CNRS. His main research interests include analogy-based and statistical language learning, speech recognition and synthesis, and machine translation. He is currently leading LIMSI's research activities on statistical machine translation.

Eric Gaussier and François YvonStéphane Clinchant and Eric GaussierMassih-Réza Amini and David Buffoni and Patrick Gallinari and Tuong Vinh Truong and Nicolas UsunierSujeevan Aseervatham and Eric Gaussier and Anestis Antoniadis and Michel Burlet and Yves DenneulinJean-Michel RendersJean-Cédric ChappelierIsabelle Tellier and Marc TommasiAlexandre Allauzen and François YvonJosiane Mothe and Kurt Englmeier and Fionn MurtaghJuan-Manuel Torres-Moreno and Marc El-Bèze and Patrice Bellot and Fréderic BéchetFrançois Yvon

Introduction	p. xiii
Part 1 Information Retrieval	p. 1
Chapter 1 Probabilistic Models for Information Retrieval	p. 3
1.1 Introduction	p. 3
1.1.1 Heuristic retrieval constraints	p. 6
1.2 2-Poisson models	p. 8
1.3 Probability ranking principle (PRP)	p. 10
1.3.1 Reformulation	p. 12
1.3.2 BM25	p. 13
1.4 Language models	p. 15
1.4.1 Smoothing methods	p. 16
1.4.2 The Kullback-Leibler model	p. 19
1.4.3 Noisy channel model	p. 20
1.4.4 Some remarks	p. 20
1.5 Informational approaches	p. 21
1.5.1 DFR models	p. 22
1.5.2 Information-based models	p. 25
1.6 Experimental comparison	p. 27
1.7 Tools for information retrieval	p. 28
1.8 Conclusion	p. 28
1.9 Bibliography	p. 29
Chapter 2 Learnable Ranking Models for Automatic Text Summarization and Information Retrieval	p. 33
2.1 Introduction	p. 33
2.1.1 Ranking of instances	p. 34
2.1.2 Ranking of alternatives	p. 42
2.1.3 Relation to existing frameworks	p. 44
2.2 Application to automatic text summarization	p. 45
2.2.1 Presentation of the application	p. 45
2.2.2 Automatic summary and learning	p. 48
2.3 Application to information retrieval	p. 49
2.3.1 Application presentation	p. 49
2.3.2 Search engines and learning	p. 50
2.3.3 Experimental results	p. 53
2.4 Conclusion	p. 54
2.5 Bibliography	p. 54
Part 2 Classification and Clustering	p. 59
Chapter 3 Logistic Regression and Text Classification	p. 61
3.1 Introduction	p. 61
3.2 Generalized linear model	p. 62
3.3 Parameter estimation	p. 65
3.4 Logistic regression	p. 68
3.4.1 Multinomial logistic regression	p. 69
3.5 Model selection	p. 70
3.5.1 Ridge regularization	p. 71
3.5.2 LASSO regularization	p. 71
3.5.3 Selected Ridge regularization	p. 72
3.6 Logistic regression applied to text classification	p. 74
3.6.1 Problem statement	p. 74
3.6.2 Data pre-processing	p. 75
3.6.3 Experimental results	p. 76
3.7 Conclusion	p. 81
3.8 Bibliography	p. 82
Chapter 4 Kernel Methods for Textual Information Access	p. 85
4.1 Kernel methods: context and intuitions	p. 85
4.2 General principles of kernel methods	p. 88
4.3 General problems with kernel choices (kernel engineering)	p. 95
4.4 Kernel versions of standard algorithms: examples of solvers	p. 97
4.4.1 Kernal logistic regression	p. 98
4.4.2 Support vector machines	p. 99
4.4.3 Principal component analysis	p. 101
4.4.4 Other methods	p. 102
4.5 Kernels for text entities	p. 103
4.5.1 "Bag-of-words" kernels	p. 104
4.5.2 Semantic kernels	p. 105
4.5.3 Diffusion kernels	p. 107
4.5.4 Sequence kernels	p. 109
4.5.5 Tree kernels	p. 112
4.5.6 Graph kernels	p. 116
4.5.7 Kernels derived from generative models	p. 119
4.6 Summary	p. 123
4.7 Bibliography	p. 124
Chapter 5 Topic-Based Generative Models for Text Information Access	p. 129
5.1 Introduction	p. 129
5.1.1 Generative versus discriminative models	p. 129
5.1.2 Text models	p. 131
5.1.3 Estimation, prediction and smoothing	p. 133
5.1.4 Terminology and notations	p. 134
5.2 Topic-based models	p. 135
5.2.1 Fundamental principles	p. 135
5.2.2 Illustration	p. 136
5.2.3 General framework	p. 138
5.2.4 Geometric interpretation	p. 139
5.2.5 Application to text categorization	p. 141
5.3 Topic models	p. 142
5.3.1 Probabilistic Latent Semantic Indexing	p. 143
5.3.2 Latent Dirichlet Allocation	p. 146
5.3.3 Conclusion	p. 160
5.4 Term models	p. 161
5.4.1 Limitations of the multinomial	p. 161
5.4.2 Dirichlet compound multinomial	p. 162
5.4.3 DCM-LDA	p. 163
5.5 Similarity measures between documents	p. 164
5.5.1 Language models	p. 165
5.5.2 Similarity between topic distributions	p. 165
5.5.3 Fisher kernels	p. 166
5.6 Conclusion	p. 168
5.7 topic model software	p. 169
5.8 Bibliography	p. 170
Chapter 6 Conditional Random Fields for Information Extraction	p. 179
6.1 Introduction	p. 179
6.2 Information extraction	p. 180
6.2.1 The task	p. 180
6.2.2 Variants	p. 182
6.2.3 Evaluations	p. 182
6.2.4 Approaches not based on machine learning	p. 183
6.3 Machine learning for information extraction	p. 184
6.3.1 Usage and limitations	p. 184
6.3.2 Some applicable machine learning methods	p. 185
6.3.3 Annotating to extract	p. 186
6.4 Introduction to conditional random fields	p. 187
6.4.1 Formalization of a labelling problem	p. 187
6.4.2 Maximum entropy model approach	p. 188
6.4.3 Hidden Markov model approach	p. 190
6.4.4 Graphical models	p. 191
6.5 Conditional random fields	p. 193
6.5.1 Definition	p. 193
6.5.2 Factorization and graphical models	p. 195
6.5.3 Junction tree	p. 196
6.5.4 Inference in CRFs	p. 198
6.5.5 Inference algorithms	p. 200
6.5.6 Training CRFs	p. 201
6.6 Conditional random fields and their applications	p. 203
6.6.1 Linear conditional random fields	p. 204
6.6.2 Links between linear CRFs and hidden Markov models	p. 205
6.6.3 Interests and applications of CRFs	p. 208
6.6.4 Beyond linear CRFs	p. 210
6.6.5 Existing libraries	p. 211
6.7 Conclusion	p. 214
6.8 Bibliography	p. 215
Part 3 Multilingualism	p. 221
Chapter 7 Statistical Methods for Machine Translation	p. 223
7.1 Introduction	p. 223
7.1.1 Machine translation in the age of the Internet	p. 223
7.1.2 Organization of the chapter	p. 226
7.1.3 Terminological remarks	p. 227
7.2 Probabilistic machine translation: an overview	p. 227
7.2.1 Statistical machine translation: the standard model	p. 228
7.2.2 Word-based models and their limitations	p. 230
7.2.3 Phrase-based models	p. 234
7.3 Phrase-based models	p. 235
7.3.1 Building word alignments	p. 237
7.3.2 Word alignment models: a summary	p. 245
7.3.3 Extracting bisegments	p. 246
7.4 Modeling reorderings	p. 250
7.4.1 The space of possible reorderings	p. 250
7.4.2 Evaluating permutations	p. 255
7.5 Translation: a search problem	p. 259
7.5.1 Combining models	p. 259
7.5.2 The decoding problem	p. 261
7.5.3 Exact search algorithms	p. 262
7.5.4 Heuristic search algorithms	p. 267
7.5.5 Decoding: a solved problem?	p. 272
7.6 Evaluating machine translation	p. 272
7.6.1 Subjective evaluations	p. 273
7.6.2 The BLEU metric	p. 275
7.6.3 Alternatives to BLEU	p. 277
7.6.4 Evaluating machine translation: an open problem	p. 279
7.7 State-of-the-art and recent developments	p. 279
7.7.1 Using source context	p. 279
7.7.2 Hierarchical models	p. 281
7.7.3 Translating with linguistic resources	p. 283
7.8 Useful resources	p. 287
7.8.1 Bibliographic data and online resources	p. 288
7.8.2 Parallel corpora	p. 288
7.8.3 Tools for statistical machine translation	p. 288
7.9 Conclusion	p. 289
7.10 Acknowledgments	p. 291
7.11 Bibliography	p. 291
Part 4 Emerging Applications	p. 305
Chapter 8 Information Mining: Methods and Interfaces for Accessing Complex Information	p. 307
8.1 Introduction	p. 307
8.2 The multidimensional visualization of information	p. 309
8.2.1 Accessing information based on the knowledge of the structured domain	p. 309
8.2.2 Visualization of a set of documents via their content	p. 313
8.2.3 OLAP principles applied to document sets	p. 317
8.3 Domain mapping via social networks	p. 320
8.4 Analyzing the variability of searches and data merging	p. 323
8.4.1 Analysis of IR engine results	p. 323
8.4.2 Use of data unification	p. 325
8.5 The seven types of evaluation measures used in IR	p. 327
8.6 Conclusion	p. 331
8.7 Acknowledgments	p. 332
8.8 Bibliography	p. 332
Chapter 9 Opinion Detection as a Topic Classification Problem	p. 337
9.1 Introduction	p. 337
9.2 The TREC and TAC evaluation campaigns	p. 339
9.2.1 Opinion detection by question-answering	p. 340
9.2.2 Automatic summarization of opinions	p. 342
9.2.3 The text mining challenge of opinion classification (DEFT (DÉfi Fouille de Textes))	p. 343
9.3 Cosine weights - a second glance	p. 347
9.4 Which components for a opinion vectors?	p. 348
9.4.1 How to pass from words to terms?	p. 349
9.5 Experiments	p. 352
9.5.1 Performance, analysis, and visualization of the results on the IMDB corpus	p. 354
9.6 Extracting opinions from speech: automatic analysis of phone polls	p. 357
9.6.1 France Télécom opinion investigation corpus	p. 358
9.6.2 Automatic recognition of spontaneous speech in opinion corpora	p. 360
9.6.3 Evaluation	p. 363
9.7 Conclusion	p. 365
9.8 Bibliography	p. 366
Appendix A Probabilistic Models: An Introduction	p. 369
A.1 Introduction	p. 369
A.2 Supervised categorization	p. 370
A.2.1 Filtering documents	p. 370
A.2.2 The Bernoulli model	p. 372
A.2.3 The multinomial model	p. 376
A.2.4 Evaluating categorization systems	p. 379
A.2.5 Extensions	p. 380
A.2.6 A first summary	p. 383
A.3 Unsupervised learning: the multinomial mixture model	p. 384
A.3.1 Mixture models	p. 384
A.3.2 Parameter estimation	p. 386
A.3.3 Applications	p. 390
A.4 Markov models: statistical models for sequences	p. 391
A.4.1 Modeling sequences	p. 391
A.4.2 Estimating a Markov model	p. 394
A.4.3 Language models	p. 395
A.5 Hidden Markov models	p. 397
A.5.1 The model	p. 398
A.5.2 Algorithms for hidden Markov models	p. 399
A.6 Conclusion	p. 410
A.7 A primer of probability theory	p. 411
A.7.1 Probability space, event	p. 411
A.7.2 Conditional independence and probability	p. 412
A.7.3 Random variables, moments	p. 413
A.7.4 Some useful distributions	p. 418
A.8 Bibliography	p. 420
List of Authors	p. 423
Index	p. 425

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents