Multimodal signal processing : theory and applications for human-computer interaction

Multimodal signal processing is an important research and development field that processes signals and combines information from a variety of modalities - speech, vision, language, text - which significantly enhance the understanding, modelling, and performance of human-computer interaction devices or systems enhancing human-human communication. The overarching theme of this book is the application of signal processing and statistical machine learning techniques to problems arising in this multi-disciplinary field. It describes the capabilities and limitations of current technologies, and discusses the technical challenges that must be overcome to develop efficient and user-friendly multimodal interactive systems.

With contributions from the leading experts in the field, the present book should serve as a reference in multimodal signal processing for signal processing researchers, graduate students, R&D engineers, and computer engineers who are interested in this emerging field.

Author Notes

Jean-Philippe Thiran received his PhD from the Universit Catholique de Louvain (UCL) in 1997. He is Assistant Professor at the Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland, responsible for the image analysis group. Dr Thiran's current scientific interests include image segmentation, multimodal signal processing and medical image analysis.
Ferran Marqus is Full Professor in the TSC Department of Universitat Polytcnica di Catalunya (UPC) where he is lecturing on the area of digital signal and image processing. He has previously held posts at EPFL and the University of Southern California. He received his PhD from UPC in December 1992.
Herv Bourlard is Director of the Idiap Research Institute, Full Professor at the Swiss Federal Institute of Technology at Lausanne(EPFL), and Director of a National Centre of Competence in Research on 'Interactive Multimodal Information Management. His current interests mainly include statistical pattern classification, signal processing, multi-channel processing, artificial neural networks, and applied mathematics.

Jean-Philippe Thiran and Ferran Marqués and Hervé BourlardSamy BengioThierry Dutoit and Stéphane DupontOlivier PietquinMontse Pardàs and Verónica Vilaplana and Cristian Canton-FerrerClaus VielhauerMihai Curban and Jean-Philippe ThiranNorman Poh and Josef KittlerMihai Gurban and jean-Philippe ThiranKonstantinos Moustakas and Savvas Argyropoulos and Dimitrios TzovarasAndrei Popescu-BelisNatalie Ruiz and Fang Chen and Sharon OviattIgor S. Pand ićStéphane Marchand-Maillet and Donn Morrison and Enikö Szekely and Eric BrunoDaniel Catica-Perez

Preface	p. xiii
1 Introduction	p. 1
Part I Signal Processing, Modelling and Related Mathematical Tools	p. 5
2 Statistical Machine Learning for HCI	p. 7
2.1 Introduction	p. 7
2.2 Introduction to Statistical Learning	p. 8
2.2.1 Types of Problem	p. 8
2.2.2 Function Space	p. 9
2.2.3 Loss Functions	p. 10
2.2.4 Expected Risk and Empirical Risk	p. 10
2.2.5 Statistical Learning Theory	p. 11
2.3 Support Vector Machines for Binary Classification	p. 13
2.4 Hidden Markov Models for Speech Recognition	p. 16
2.4.1 Speech Recognition	p. 17
2.4.2 Markovian Processes	p. 17
2.4.3 Hidden Markov Models	p. 18
2.4.4 Inference and Learning with HMMs	p. 20
2.4.5 HMMs for Speech Recognition	p. 22
2.5 Conclusion	p. 22
References	p. 23
3 Speech Processing	p. 25
3.1 Introduction	p. 26
3.2 Speech Recognition	p. 28
3.2.1 Feature Extraction	p. 28
3.2.2 Acoustic Modelling	p. 30
3.2.3 Language Modelling	p. 33
3.2.4 Decoding	p. 34
3.2.5 Multiple Sensors	p. 35
3.2.6 Confidence Measures	p. 37
3.2.7 Robustness	p. 38
3.3 Speaker Recognition	p. 40
3.3.1 Overview	p. 40
3.3.2 Robustness	p. 43
3.4 Text-to-Speech Synthesis	p. 44
3.4.1 Natural Language Processing for Speech Synthesis	p. 44
3.4.2 Concatenative Synthesis with a Fixed Inventory	p. 46
3.4.3 Unit Selection-Based Synthesis	p. 50
3.4.4 Statistical Parametric Synthesis	p. 53
3.5 Conclusions	p. 56
References	p. 57
4 Natural Language and Dialogue Processing	p. 63
4.1 Introduction	p. 63
4.2 Natural Language Understanding	p. 64
4.2.1 Syntactic Parsing	p. 64
4.2.2 Semantic Parsing	p. 68
4.2.3 Contextual Interpretation	p. 70
4.3 Natural Language Generation	p. 71
4.3.1 Document Planning	p. 72
4.3.2 Microplanning	p. 73
4.3.3 Surface Realisation	p. 73
4.4 Dialogue Processing	p. 74
4.4.1 Discourse Modelling	p. 74
4.4.2 Dialogue Management	p. 77
4.4.3 Degrees of Initiative	p. 80
4.4.4 Evaluation	p. 81
4.5 Conclusion	p. 85
References	p. 85
5 Image and Video Processing Tools for HCI	p. 93
5.1 Introduction	p. 93
5.2 Face Analyses	p. 94
5.2.1 Face Detection	p. 95
5.2.2 Face Tracking	p. 96
5.2.3 Facial Feature Detection and Tracking	p. 98
5.2.4 Gaze Analysis	p. 100
5.2.5 Face Recognition	p. 101
5.2.6 Facial Expression Recognition	p. 103
5.3 Hand-Gesture Analysis	p. 104
5.4 Head Orientation Analysis and FoA Estimation	p. 106
5.4.1 Head Orientation Analysis	p. 106
5.4.2 Focus of Attention Estimation	p. 107
5.5 Body Gesture Analysis	p. 109
5.6 Conclusions	p. 112
References	p. 112
6 Processing of Handwriting and Sketching Dynamics	p. 119
6.1 Introduction	p. 119
6.2 History of Handwriting Modality and the Acquisition of Online Handwriting Signals	p. 121
6.3 Basics in Acquisition, Examples for Sensors	p. 123
6.4 Analysis of Online Handwriting and Sketching Signals	p. 124
6.5 Overview of Recognition Goals in HCI	p. 125
6.6 Sketch Recognition for User Interface Design	p. 128
6.7 Similarity Search in Digital Ink	p. 133
6.8 Summary and Perspectives for Handwriting and Sketching in HCI	p. 138
References	p. 139
Part II Multimodal Signal Processing and Modelling	p. 143
7 Basic Concepts of Multimodal Analysis	p. 143
7.1 Defining Multimodality	p. 145
7.2 Advantages of Multimodal Analysis	p. 148
7.3 Conclusion	p. 151
References	p. 152
8 Multimodal Information Fusion	p. 153
8.1 Introduction	p. 153
8.2 Levels of Fusion	p. 156
8.3 Adaptive versus Non-Adaptive Fusion	p. 158
8.4 Other Design Issues	p. 162
8.5 Conclusions	p. 165
References	p. 165
9 Modality Integration Methods	p. 171
9.1 Introduction	p. 171
9.2 Multimodal Fusion for AVSR	p. 172
9.2.1 Types of Fusion	p. 172
9.2.2 Multistream HMMs	p. 174
9.2.3 Stream Reliability Estimates	p. 174
9.3 Multimodal Speaker Localisation	p. 178
9.4 Conclusion	p. 181
References	p. 181
10 A Multimodal Recognition Framework for Joint Modality Compensation and Fusion	p. 185
10.1 Introduction	p. 186
10.2 Joint Modality Recognition and Applications	p. 188
10.3 A New Joint Modality Recognition Scheme	p. 191
10.3.1 Concept	p. 191
10.3.2 Theoretical Background	p. 191
10.4 Joint Modality Audio-Visual Speech Recognition	p. 194
10.4.1 Signature Extraction Stage	p. 196
10.4.2 Recognition Stage	p. 197
10.5 Joint Modality Recognition in Biometrics	p. 198
10.5.1 Overview	p. 198
10.5.2 Results	p. 199
10.6 Conclusions	p. 203
References\|204
11 Managing Multimodal Data, Metadata and Annotations: Challenges and Solutions	p. 207
11.1 Introduction	p. 208
11.2 Setting the Stage: Concepts and Projects	p. 208
11.2.l Metadate-versusAnnotations	p. 209
11.2.2 Examples of Large Multimodal Collections	p. 210
11.3 Capturing and Recording Multimodal Data	p. 211
11.3.1 Capture Devices	p. 211
11.3.2 Synchronisation	p. 212
11.3.3 Activity Types in Multimodal Corpora	p. 213
11.3.4 Examples of Set-ups and Raw Data	p. 213
11.4 Reference Metadata and Annotations	p. 214
11.4.1 Gathering Metadata: Methods	p. 215
11.4.2 Metadata for the AMI Corpus	p. 216
11.4.3 Reference Annotations: Procedure and Tools	p. 217
11.5 Data Storage and Access	p. 219
111.5.1 Exchange Formats for Metadata and Annotations	p. 219
111.5.2 Data Servers	p. 221
111.5.3 Accessing Annotated Multimodal Data	p. 222
11.6 Conclusions and Perspectives	p. 223
References	p. 224
Part III Multimodal Human-Computer and Human-to-Human Interaction	p. 229
12 Multimodal Input	p. 231
12.1 Introduction	p. 231
12.2 Advantages of Multimodal Input Interfaces	p. 232
12.2.1 State-of-the-Art Multimodal Input Systems	p. 234
12.3 Multimodality, Cognition and Performance	p. 237
12.3.1 Multimodal Perception and Cognition	p. 237
12.3.2 Cognitive Load and Performance	p. 238
12.4 Understanding Multimodal Input Behaviour	p. 239
12.4.1 Theoretical Frameworks	p. 240
12.4.2 Interpretation of Multimodal Input Patterns	p. 243
12.5 Adaptive Multimodal Interfaces	p. 245
12.5.1 Designing Multimodal Interfaces that Manage Users' Cognitive Load	p. 246
12.5.2 Designing Low-Load Multimodal Interfaces for Education	p. 248
12.6 Conclusions and Future Directions	p. 250
References	p. 251
13 MuItimodal Output: Facial Motion, Gestures and Synthesised Speech Synchronisation	p. 257
13.1 Introduction	p. 257
13.2 Basic AV Speech Synthesis	p. 258
13.3 The Animation System	p. 260
13.4 Coarticulation	p. 263
13.5 Extended AV Speech Synthesis	p. 264
13.5.1 Data-Driven Approaches	p. 267
13.5.2 Rule-Based Approaches	p. 269
13.6 Embodied Conversational Agents	p. 270
13.7 TTS Timing Issues	p. 272
13.7.1 On-the-Fly Synchronisation	p. 272
13.7.2 A Priori Synchronisation	p. 273
13.8 Conclusion	p. 274
References	p. 274
14 Interactive Representations of Multimodal Databases	p. 279
14.1 Introduction	p. 279
14.2 Multimodal Data Representation	p. 280
14.3 Multimodal Data Access	p. 283
14.3.1 Browsing as Extension of the Query Formulation Mechanism	p. 283
14.3.2 Browsing for the Exploration of the Content Space	p. 287
14.3.3 Alternative Representations	p. 292
14.3.4 Evaluation	p. 292
14.3.5 Commercial Impact	p. 293
14.4 Gaining Semantic from User Interaction	p. 294
14.4.1 Multimodal Interactive Retrieval	p. 294
14.4.2 Crowdsourcing	p. 295
14.5 Conclusion and Discussion	p. 298
References	p. 299
15 Modelling Interest in Face-to-Face Conversations from Multimodal Nonverbal Behaviour	p. 309
15.1 Introduction	p. 309
15.2 Perspectives on Interest Modelling	p. 311
15.3 Computing Interest from Audio Cues	p. 315
15.4 Computing interest from Multimodal Cues	p. 318
15.5 Other Concepts Related to Interest	p. 320
15.6 Concluding Remarks	p. 322
References	p. 323
Index	p. 327

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents