Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010218500 | QA76.9.H85 M84 2010 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
Multimodal signal processing is an important research and development field that processes signals and combines information from a variety of modalities - speech, vision, language, text - which significantly enhance the understanding, modelling, and performance of human-computer interaction devices or systems enhancing human-human communication. The overarching theme of this book is the application of signal processing and statistical machine learning techniques to problems arising in this multi-disciplinary field. It describes the capabilities and limitations of current technologies, and discusses the technical challenges that must be overcome to develop efficient and user-friendly multimodal interactive systems.
With contributions from the leading experts in the field, the present book should serve as a reference in multimodal signal processing for signal processing researchers, graduate students, R&D engineers, and computer engineers who are interested in this emerging field.
Author Notes
Jean-Philippe Thiran received his PhD from the Universit Catholique de Louvain (UCL) in 1997. He is Assistant Professor at the Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland, responsible for the image analysis group. Dr Thiran's current scientific interests include image segmentation, multimodal signal processing and medical image analysis.
Ferran Marqus is Full Professor in the TSC Department of Universitat Polytcnica di Catalunya (UPC) where he is lecturing on the area of digital signal and image processing. He has previously held posts at EPFL and the University of Southern California. He received his PhD from UPC in December 1992.
Herv Bourlard is Director of the Idiap Research Institute, Full Professor at the Swiss Federal Institute of Technology at Lausanne(EPFL), and Director of a National Centre of Competence in Research on 'Interactive Multimodal Information Management. His current interests mainly include statistical pattern classification, signal processing, multi-channel processing, artificial neural networks, and applied mathematics.
Table of Contents
Preface | p. xiii |
1 Introduction | p. 1 |
Part I Signal Processing, Modelling and Related Mathematical Tools | p. 5 |
2 Statistical Machine Learning for HCI | p. 7 |
2.1 Introduction | p. 7 |
2.2 Introduction to Statistical Learning | p. 8 |
2.2.1 Types of Problem | p. 8 |
2.2.2 Function Space | p. 9 |
2.2.3 Loss Functions | p. 10 |
2.2.4 Expected Risk and Empirical Risk | p. 10 |
2.2.5 Statistical Learning Theory | p. 11 |
2.3 Support Vector Machines for Binary Classification | p. 13 |
2.4 Hidden Markov Models for Speech Recognition | p. 16 |
2.4.1 Speech Recognition | p. 17 |
2.4.2 Markovian Processes | p. 17 |
2.4.3 Hidden Markov Models | p. 18 |
2.4.4 Inference and Learning with HMMs | p. 20 |
2.4.5 HMMs for Speech Recognition | p. 22 |
2.5 Conclusion | p. 22 |
References | p. 23 |
3 Speech Processing | p. 25 |
3.1 Introduction | p. 26 |
3.2 Speech Recognition | p. 28 |
3.2.1 Feature Extraction | p. 28 |
3.2.2 Acoustic Modelling | p. 30 |
3.2.3 Language Modelling | p. 33 |
3.2.4 Decoding | p. 34 |
3.2.5 Multiple Sensors | p. 35 |
3.2.6 Confidence Measures | p. 37 |
3.2.7 Robustness | p. 38 |
3.3 Speaker Recognition | p. 40 |
3.3.1 Overview | p. 40 |
3.3.2 Robustness | p. 43 |
3.4 Text-to-Speech Synthesis | p. 44 |
3.4.1 Natural Language Processing for Speech Synthesis | p. 44 |
3.4.2 Concatenative Synthesis with a Fixed Inventory | p. 46 |
3.4.3 Unit Selection-Based Synthesis | p. 50 |
3.4.4 Statistical Parametric Synthesis | p. 53 |
3.5 Conclusions | p. 56 |
References | p. 57 |
4 Natural Language and Dialogue Processing | p. 63 |
4.1 Introduction | p. 63 |
4.2 Natural Language Understanding | p. 64 |
4.2.1 Syntactic Parsing | p. 64 |
4.2.2 Semantic Parsing | p. 68 |
4.2.3 Contextual Interpretation | p. 70 |
4.3 Natural Language Generation | p. 71 |
4.3.1 Document Planning | p. 72 |
4.3.2 Microplanning | p. 73 |
4.3.3 Surface Realisation | p. 73 |
4.4 Dialogue Processing | p. 74 |
4.4.1 Discourse Modelling | p. 74 |
4.4.2 Dialogue Management | p. 77 |
4.4.3 Degrees of Initiative | p. 80 |
4.4.4 Evaluation | p. 81 |
4.5 Conclusion | p. 85 |
References | p. 85 |
5 Image and Video Processing Tools for HCI | p. 93 |
5.1 Introduction | p. 93 |
5.2 Face Analyses | p. 94 |
5.2.1 Face Detection | p. 95 |
5.2.2 Face Tracking | p. 96 |
5.2.3 Facial Feature Detection and Tracking | p. 98 |
5.2.4 Gaze Analysis | p. 100 |
5.2.5 Face Recognition | p. 101 |
5.2.6 Facial Expression Recognition | p. 103 |
5.3 Hand-Gesture Analysis | p. 104 |
5.4 Head Orientation Analysis and FoA Estimation | p. 106 |
5.4.1 Head Orientation Analysis | p. 106 |
5.4.2 Focus of Attention Estimation | p. 107 |
5.5 Body Gesture Analysis | p. 109 |
5.6 Conclusions | p. 112 |
References | p. 112 |
6 Processing of Handwriting and Sketching Dynamics | p. 119 |
6.1 Introduction | p. 119 |
6.2 History of Handwriting Modality and the Acquisition of Online Handwriting Signals | p. 121 |
6.3 Basics in Acquisition, Examples for Sensors | p. 123 |
6.4 Analysis of Online Handwriting and Sketching Signals | p. 124 |
6.5 Overview of Recognition Goals in HCI | p. 125 |
6.6 Sketch Recognition for User Interface Design | p. 128 |
6.7 Similarity Search in Digital Ink | p. 133 |
6.8 Summary and Perspectives for Handwriting and Sketching in HCI | p. 138 |
References | p. 139 |
Part II Multimodal Signal Processing and Modelling | p. 143 |
7 Basic Concepts of Multimodal Analysis | p. 143 |
7.1 Defining Multimodality | p. 145 |
7.2 Advantages of Multimodal Analysis | p. 148 |
7.3 Conclusion | p. 151 |
References | p. 152 |
8 Multimodal Information Fusion | p. 153 |
8.1 Introduction | p. 153 |
8.2 Levels of Fusion | p. 156 |
8.3 Adaptive versus Non-Adaptive Fusion | p. 158 |
8.4 Other Design Issues | p. 162 |
8.5 Conclusions | p. 165 |
References | p. 165 |
9 Modality Integration Methods | p. 171 |
9.1 Introduction | p. 171 |
9.2 Multimodal Fusion for AVSR | p. 172 |
9.2.1 Types of Fusion | p. 172 |
9.2.2 Multistream HMMs | p. 174 |
9.2.3 Stream Reliability Estimates | p. 174 |
9.3 Multimodal Speaker Localisation | p. 178 |
9.4 Conclusion | p. 181 |
References | p. 181 |
10 A Multimodal Recognition Framework for Joint Modality Compensation and Fusion | p. 185 |
10.1 Introduction | p. 186 |
10.2 Joint Modality Recognition and Applications | p. 188 |
10.3 A New Joint Modality Recognition Scheme | p. 191 |
10.3.1 Concept | p. 191 |
10.3.2 Theoretical Background | p. 191 |
10.4 Joint Modality Audio-Visual Speech Recognition | p. 194 |
10.4.1 Signature Extraction Stage | p. 196 |
10.4.2 Recognition Stage | p. 197 |
10.5 Joint Modality Recognition in Biometrics | p. 198 |
10.5.1 Overview | p. 198 |
10.5.2 Results | p. 199 |
10.6 Conclusions | p. 203 |
References|204 | |
11 Managing Multimodal Data, Metadata and Annotations: Challenges and Solutions | p. 207 |
11.1 Introduction | p. 208 |
11.2 Setting the Stage: Concepts and Projects | p. 208 |
11.2.l Metadate-versusAnnotations | p. 209 |
11.2.2 Examples of Large Multimodal Collections | p. 210 |
11.3 Capturing and Recording Multimodal Data | p. 211 |
11.3.1 Capture Devices | p. 211 |
11.3.2 Synchronisation | p. 212 |
11.3.3 Activity Types in Multimodal Corpora | p. 213 |
11.3.4 Examples of Set-ups and Raw Data | p. 213 |
11.4 Reference Metadata and Annotations | p. 214 |
11.4.1 Gathering Metadata: Methods | p. 215 |
11.4.2 Metadata for the AMI Corpus | p. 216 |
11.4.3 Reference Annotations: Procedure and Tools | p. 217 |
11.5 Data Storage and Access | p. 219 |
111.5.1 Exchange Formats for Metadata and Annotations | p. 219 |
111.5.2 Data Servers | p. 221 |
111.5.3 Accessing Annotated Multimodal Data | p. 222 |
11.6 Conclusions and Perspectives | p. 223 |
References | p. 224 |
Part III Multimodal Human-Computer and Human-to-Human Interaction | p. 229 |
12 Multimodal Input | p. 231 |
12.1 Introduction | p. 231 |
12.2 Advantages of Multimodal Input Interfaces | p. 232 |
12.2.1 State-of-the-Art Multimodal Input Systems | p. 234 |
12.3 Multimodality, Cognition and Performance | p. 237 |
12.3.1 Multimodal Perception and Cognition | p. 237 |
12.3.2 Cognitive Load and Performance | p. 238 |
12.4 Understanding Multimodal Input Behaviour | p. 239 |
12.4.1 Theoretical Frameworks | p. 240 |
12.4.2 Interpretation of Multimodal Input Patterns | p. 243 |
12.5 Adaptive Multimodal Interfaces | p. 245 |
12.5.1 Designing Multimodal Interfaces that Manage Users' Cognitive Load | p. 246 |
12.5.2 Designing Low-Load Multimodal Interfaces for Education | p. 248 |
12.6 Conclusions and Future Directions | p. 250 |
References | p. 251 |
13 MuItimodal Output: Facial Motion, Gestures and Synthesised Speech Synchronisation | p. 257 |
13.1 Introduction | p. 257 |
13.2 Basic AV Speech Synthesis | p. 258 |
13.3 The Animation System | p. 260 |
13.4 Coarticulation | p. 263 |
13.5 Extended AV Speech Synthesis | p. 264 |
13.5.1 Data-Driven Approaches | p. 267 |
13.5.2 Rule-Based Approaches | p. 269 |
13.6 Embodied Conversational Agents | p. 270 |
13.7 TTS Timing Issues | p. 272 |
13.7.1 On-the-Fly Synchronisation | p. 272 |
13.7.2 A Priori Synchronisation | p. 273 |
13.8 Conclusion | p. 274 |
References | p. 274 |
14 Interactive Representations of Multimodal Databases | p. 279 |
14.1 Introduction | p. 279 |
14.2 Multimodal Data Representation | p. 280 |
14.3 Multimodal Data Access | p. 283 |
14.3.1 Browsing as Extension of the Query Formulation Mechanism | p. 283 |
14.3.2 Browsing for the Exploration of the Content Space | p. 287 |
14.3.3 Alternative Representations | p. 292 |
14.3.4 Evaluation | p. 292 |
14.3.5 Commercial Impact | p. 293 |
14.4 Gaining Semantic from User Interaction | p. 294 |
14.4.1 Multimodal Interactive Retrieval | p. 294 |
14.4.2 Crowdsourcing | p. 295 |
14.5 Conclusion and Discussion | p. 298 |
References | p. 299 |
15 Modelling Interest in Face-to-Face Conversations from Multimodal Nonverbal Behaviour | p. 309 |
15.1 Introduction | p. 309 |
15.2 Perspectives on Interest Modelling | p. 311 |
15.3 Computing Interest from Audio Cues | p. 315 |
15.4 Computing interest from Multimodal Cues | p. 318 |
15.5 Other Concepts Related to Interest | p. 320 |
15.6 Concluding Remarks | p. 322 |
References | p. 323 |
Index | p. 327 |