NRC Nokia Research Center
Ruoholahti, Helsinki, Finland
Monday, November 22, 2004 at 10:00-12:00
1 Background
1990s saw rising interest in bimodality, especially speech & pen-based input, but also other mode combinations like speech & gaze, speech & pointing, speech & rich pen input, and speech & manual gestures. 1990s started with research on recognition of read speech with high-quality microphones and ended up with conversational speech and varied, low-quality microphones. In the current era, the era of multimodal interfaces, it has been shown that combining input modalities and sensors reduces the error rate significantly. The “noughties” have witnessed interest in advanced multimodal interfaces that entail three or more modalities, sensors, and even biometrics. However, the basic problems with bimodal interfaces remain to be solved. Multimodal interfaces are radically different from GUIs, in many respects. They involve parallel input and output, probabilistic reasoning on input, time-criticality, multiagent architectures. Moreover, they imply new research topics such as dialogue modeling and modality fusion. With only traditional GUIs, standard computer’s capabilities are not in synch with users’ input/output capabilities. |
Brief Biography
|
The potential advantages of multimodality are, according to Prof. Oviatt:
- Flexibility
- Expressive power
- Simplified linguistic constructions (in comparison to unimodal speech interfaces)
- Support for different interaction styles
- Accommodation for more users, tasks, and settings
- Improved error handling & robustness (in comparison to other probabilistic reasoning based interaction modes)
- New forms of interaction, including mobile, tangible, and adaptive interfaces
- Minimizes cognitive load as task complexity or situational stress increases.
The first question tackled by Prof. Oviatt is the robustness of interaction inferences. Is there any benefit for using multimodal interfaces over corresponding unimodal ones?
There have been research advances in what is called mutual disambiguation (MD). For example, speech input by itself is often not enough because of technical (poor mic) or contextual (noisy background) difficulties. In an empirical study, Oviatt showed that a multimodal system is better able to recognize interactions of native and accented speakers than a unimodal system. Here, the MD pull-up (improvement in recognition accuracy due to a second source of input) is .65 with accented users, whilst .35 with native users, the overall improvement being 41.3% (in comparison to an unimodal baseline system, I guess). In another study, Oviatt showed that MD pull-up is .34 in mobile setting, meaning a pull-up due to gesture recognition in a noisy background context (in comparison to an unimodal system).
Oviatt’s conclusion is that multimodality supports robustness in interaction inference over unimodal systems, and that the error reduction is significant (up to .65) and reliable across several users, tasks, and interfaces. Multimodal interfaces suit noisy contexts and users better than their alternatives. A future challenge is to reduce the error rate to below 1%, the robustness barrier.
The second of the main problems is the integration of data from two or more modalities to a single inference of a meaningful event in interaction. This is closely related to the first problem. One might ask what is the point in combining one or more error-prone modalities together, would it only create more uncertainty and chaos? The answer is that multimodality, because of redundancy, increases the accuracy of interaction inferences. Partial information coming from two or more channels may reinforce one legal interpretation over the other. In theory, this is understandable, as more sense data about a distal event is naturally going to lead to a better proximal inference of that event. In practice, however, the issue is more complicated. The key is in modeling how the several input channels are related. Here, the temporal relatedness of the inputs is crucial: precedence, overlapping (or lack of it), latencies, turn-taking etc.
Some of the myths about multimodal integration among engineers involve:
Consequently, even state-of-the-art systems have fixed thresholds for integration latencies, which cannot be regarded adaptive and runs counter recent body of empirical findings.
As Oviatt’s work has shown, fixed threshold solutions are unacceptable, as 70 % of people are “simultaneous” integrators (SIM) and 30 % sequential integrators (SEQ), the distribution being more or less bimodal, with only few people falling in between the two. Simultaneous integrators start input in one modality before input in another modality has ended, whereas sequential integrators leave a temporal gap before switching to the other. Oviatt’s follow-up studies to this phenomenon have indicated that this might be a remarkably consistent trait of people that they are not willing to change even due to training or instructions, even in a situation where the old style is penalized. A longitudinal study revealed that in such unrewarding conditions, only 10% change their style, whilst with no penalty there is only about 3% change.
Moreover, Oviatt found that people accentuate these patterns. SEQ users make latencies between inputs longer while SIM users make inputs overlap even more. This is what Oviatt calls hypertiming, and she points out that this kind of adaptation is not fruitful from the point of view of recognition accuracy. Furthermore, Oviatt found “entrenching” in integration patterns during increased task difficulty, what she believed was due to an attempt to restoring equilibrium (in the spirit of Gestalt theories) to an unbalanced situation. My own interpretation is that hypertiming is a response to poor recognition accuracy of the system in the first place. Exaggeration and hyperarticulation is very common in communicative situations where we have to get a point across to a children, a non-native speaker etc.
Oviatt listed topics for future work such as:
- user modeling and automatic learning of existing integration patterns
- breaking the robustness barrier (lowering errors to < 1%)
- M3 interfaces (multisensor-multimodal-multibiometric)
- Novel mobile adaptive interfaces
However, the discussion after the lecture pointed out that several challenges are apparent:
- If input is more limited than output, could we do more with output? For example, could we give feedback continuously?
- How people adapt to adaptivity? Adaptation is rarely uni-directional, but more likely reciprocal by nature.
- How could we use multimodality for human-human communication? For example, transitions between communication channels?
- Does multimodality suit closed-loop tasks where the user is in continuous interaction with the application (e.g., gaming)?
- A more general question is that how does Oviatt’s results on MD and integration patterns generalize to other modalities and tasks?
- What does hybrid integration entail? Probabilistic reasoning and semantic high-level inferences might not be the way, due to problems faced already in strong AI etc.
- Could multimodality be used in smart spaces?
- Do we need different multimodality models for different clusters of people, e.g. seniors?
- Is multimodal output integration in what relation to multimodal input integration? There is lots of separate research done on perception and output, but here we need to attack the two together.
- Low responsiveness of multimodal systems may lead to frustration; we need to study how to adapt response thresholds for different user types.
I personally had a feeling that the questions Oviatt had addressed are important, but many other key questions still remain open:
- The concept of multimodality in Oviatt’s experiments is a bit slanted. I believe that it is easy to show that by adding more input channels, recognition accuracy improves. This cannot be a surprise to anyone who knows information theory. However, as Oviatt noted, about 70% of input in mobile settings is actually unimodal, not multimodal! This means that the question should not be that much about how to integrate the two modalities but how to offer meaningful choices for the users to switch, fluently, between the modalities. For example, in a mobile situation, how could you use a map interface auditorily while cycling a bike or reading emails while sipping coffee in a café. Thus, I see that the crucial topics of modality-switching, modality-control and the modality-bound nature of representations is missing from the present work.
- Second, I think that Oviatt’s concept of mobility was quite restricted, specifically as she operationalized it as noisy background, not as different aspects of “doing” things while being mobile (or not being able to do things because of being mobile). I think that partly because of this shortcoming, the UIs Oviatt went through were not particularly convincing to me, but entailed lots of heavy and resource-reserving interaction to perform simple operations like zooming a map. A broader view to mobility would help to see what kind of situations/users/contexts/services/… would most likely benefit from multimodality.
- Third, and related to the second, although Oviatt had done impressive job in comparing multimodal interfaces to unimodal ones, at no point was she comparing multimodality to high-quality GUI solutions. Thus, I believe, the baseline for comparison was not well justified. I could argue that instead of using both pen and speech (both very prone to errors), one could, with good GUI design, have much better user experience, especially in the map applications that Oviatt was using as examples of her work. If multimodal interfaces are to replace GUIs, setting the baseline to unimodal interfaces is misleading.
Link to Prof. Oviatt’s publications: http://www.cse.ogi.edu/CHCC/Personnel/oviatt.html .