Towards a Technology of Nonverbal Communication: Vocal ...

[Pages:23]Towards a Technology of

Nonverbal Communication: Vocal

Behavior in Social and Affective

Phenomena

Alessandro Vinciarelli University of Glasgow - Department of Computing Science Sir Alwyn Williams Building, Glasgow G12 8QQ, UK Idiap Research Institute - CP592, 1920 Martigny, Switzerland Gelareh Mohammadi Idiap Research Institute - CP592, 1920 Martigny, Switzerland Ecole Polytechnique F?d?rale de Lausanne - EPFL, 1015 Lausanne, Switzerland

ABSTRACT

Nonverbal communication is the main channel through which we experience inner life of others, including their emotions, feelings, moods, social attitudes, etc. This attracts the interest of the computing community because nonverbal communication is based on cues like facial expressions, vocalizations, gestures, postures, etc. that we can perceive with our senses and can be (and often are) detected, analyzed and synthesized with automatic approaches. In other words, nonverbal communication can be used as a viable interface between computers and some of the most important aspects of human psychology such as emotions and social attitudes. As a result, a new computing domain seems to emerge that we can define "technology of nonverbal communication". This chapter outlines some of the most salient aspects of such a potentially new domain and outlines some of its most important perspectives for the future.

INTRODUCTION

Nonverbal communication is one of the most pervasive phenomena of our everyday life. On one hand, just because we have a body and we are alive, we constantly display a large number of nonverbal behavioral cues like facial expressions, vocalizations, postures, gestures, appearance, etc. (Knapp & Hall, 1972; Richmond & McCroskey, 1995). On the other hand, just because we sense and perceive the cues others display, we cannot avoid interpreting and understanding them (often outside conscious awareness) in terms of feelings, emotions, attitudes, intentions, etc. (Kunda, 1999; Poggi, 2007). Thus, "We cannot not communicate" (Watzlawick et al., 1967) even when we sleep and still display our feelings (of which we are unaware) through movements, facial expressions, etc., or when we make it clear that we do not want to communicate:

If two humans come together it is virtually inevitable that they will communicate something to each other [...] even if they do not speak, messages will pass between them. By their looks, expressions and body movement each will tell the other something, even if it is only, "I don't wish to know you: keep your distance"; "I assure you the feeling is mutual. I'll keep clear if you do".

(Argyle, 1979).

2

As nonverbal communication is such a salient and ubiquitous aspect of our life, it is not surprising to observe that computing technology, expected to integrate our daily life seamlessly and naturally like no one else, identifies automatic understanding and synthesis of nonverbal communication as a key step towards human-centered computers, i.e. computers adept to our natural modes of operating and communicating (Pantic et al., 2007; Pantic et al., 2008). This is the case of Affective Computing, where the aim is automatic understanding and synthesis of emotional states (Picard, 2000), of certain trends in Human-Computer Interaction, where the goal is to interface machines with the psychology of users (Reeves & Nass, 1996; Nass & Brave, 2005), of research in Embodied Conversational Agents, where the goal is to simulate credible human behavior with synthetic characters or robots (Bickmore & Cassell, 2005), and of the emerging field of Social Signal Processing, where the target is to understand mutual relational attitudes (social signals) of people involved in social interactions (Vinciarelli et al., 2008; Vinciarelli et al., 2009).

This list of domains is by no means complete, but it is sufficient to show how a nonverbal communication technology is actually developing in the computing community. Its main strength is an intense crossfertilization between machine intelligence (e.g., speech processing, computer vision and machine learning) and human sciences (e.g., psychology, anthropology and sociology) and its main targets are artificial forms of social, emotional and affective intelligence (Albrecht, 2005; Goleman, 2005). Furthermore, social and psychological research increasingly relies on technologies related to nonverbal communication to develop insights about human-human interactions, like in the case of large scale social networks (Lazer et al., 2009), organizational behavior (Olguin et al., 2009), and communication in mobile spaces (Raento et al., 2009).

This chapter aims at highlighting the most important aspects of this research trend and includes two main parts. The first introduces the main aspects of nonverbal communication technology and the second shows how this last is applied to the analysis of social and affective phenomena. The first part introduces a general model of human-human communication, proposes a taxonomy of nonverbal behavioral cues that can be used as perceivable stimuli in communication, and outlines the general process that nonverbal communication technology implements. The second part illustrates the most important phenomena taking place during social interactions, provides a survey of works showing how technology deals with them, and proposes the recognition of emotions in speech as a methodological example of the inference of social and affective phenomena from vocal (nonverbal) behavior. The chapter ends with a description of the emerging domain of Social Signal Processing (the most recent research avenue centered on nonverbal communication) and a list of application domains likely to benefit from the technologies described in this chapter.

PSYCHOLOGY AND TECHNOLOGY OF NONVERBAL COMMUNICATION

Nonverbal communication is a particular case of human-human communication where the means used to exchange information consists of nonverbal behavioral cues (Knapp & Hall, 1972; Richmond & McCroskey, 1995). This is appealing from a technological point of view because nonverbal cues must necessarily be accessible to our senses (in particular sight and hearing) and this makes them detectable through microphones, cameras or other suitable sensors, a conditio sine qua non for computing technology. Furthermore, many nonverbal behavioral cues are displayed outside conscious awareness and this makes them honest, i.e. sincere and reliable indices of different facets of affect (Pentland, 2008). In other words, nonverbal behavioral cues are the physical, machine detectable evidence of affective phenomena not otherwise accessible to experience, an ideal point for technology and human sciences to meet.

The rest of this section outlines the most important aspects of nonverbal communication from both psychological and technological points of view.

3

Psychology of Nonverbal Communication

In very general terms (Poggi, 2007), communication takes place whenever an Emitter E produces a signal under the form of a Perceivable Stimulus PS and this reaches a Receiver R who interprets the signal and extracts an Information I from it, not necessarily the one that E actually wanted to convey. The emitter, and the same applies to the receiver, is not necessarily an individual person, it can be a group of individuals, a machine, an animal or any other entity capable of generating perceivable stimuli. These include whatever can be perceived by a receiver like sounds, signs, words, chemical traces, handwritten messages, images, etc.

Signals can be classed as either communicative or informative on one hand, and as either direct or indirect on the other hand. A signal is said to be communicative when it is produced by an emitter with the intention of conveying a specific meaning, e.g. the "thumb up" to mean "OK", while it is informative when it is emitted unconsciously or without the intention of conveying a specific meaning, e.g. crossing arms during a conversation. In parallel, a signal is said to be direct when its meaning is context independent, e.g. the "thumb up" that means "OK" in any interaction context, and indirect in the opposite case, e.g. crossing arms when used by workers in strike to mean that they refuse to work.

In this framework, the communication is said to be nonverbal whenever the perceivable stimuli used as signals are nonverbal behavioral cues, i.e. the miriad of observable behaviors that accompany any human-human (and human-machine) interaction and do not involve language and words: facial expressions, blinks, laughter, speech pauses, gestures (conscious and unconscious), postures, body movements, head nods, etc. (Knapp & Hall, 1972; Richmond & McCroskey, 1995). In general, nonverbal communication is particularly interesting when it involves informative behavioral cues. The reason is that these are typically produced outside conscious awareness and can be considered honest signals (Pentland, 2008), i.e. signals that leak reliable information about the actual inner state and feelings of people, whether these correspond to emotional states like anger, fear and surprise, general conditions like arousal, calm, and tiredness, or attitudes towards others like empathy, interest, dominance and disappointment (Ambady et al., 2000; Ambady & Rosenthal, 1992).

Social psychology proposes to group all nonverbal behavioral cues into five classes called codes (Hecht et al., 1999): physical appearance, gestures and postures, face and eyes behavior, vocal behavior, and space and environment. Table 1 reports some of the most common nonverbal behavioral cues of each code and shows the social and affective phenomena most closely related to them. By "related" it is meant that the cue accounts for the phenomenon taking place and/or influences the perception of the same phenomenon. The cues listed in this section are the most important, but the list is by no means exhaustive. The interested reader can refer to specialized monographs (Knapp & Hall, 1972; Richmond & McCroskey, 1995) for an extensive survey. In the following, codes and some of their most important cues are described in more detail.

Physical appearance: Aspect, and in particular attractiveness, is a signal that cannot be hidden and has a major impact on the perception of others. After the first pioneering investigations (Dion et al., 1972), a large body of empirical evidence supports the "Halo effect", also known as "What is beautiful is good", i.e. the tendence to attribute socially desirable characteristics to physically attractive people. This has been measured through the higher success rate of politicians judged attractive by their electors (Surawski & Osso, 2006), through the higher percentage of individuals significantly taller than the average among

4

CEOs of large companies (Gladwell, 2005), or through the higher likelihood of starting new relationships that attractive people have (Richmond & McCroskey, 1995). Furthermore, there is a clear influence of people somatotype (the overall body shape) on personality traits attribution, e.g., thin people tend to be considered lower in emotional stability, while round persons tend to be considered higher in openness (Cortes & Gatti, 1965).

Gestures and Postures: Gestures are often performed consciously to convey some specific meaning (e.g., the thumb up gesture that means "OK") or to perform a specific action (e.g., to point at something with the index finger), but in many cases they are the result of some affective process and they are displayed outside conscious awareness (Poggi, 2007). This is the case of adaptors (self-touching, manipulation of small objects, rhythmic movements of legs, etc.) that typically account for boredom, uncomfort, and other negative feelings, and self-protection gestures like folding arms and crossing legs (Knapp & Hall, 1972; Richmond & McCroskey, 1995). Furthermore, recent studies have shown that gestures express emotions (Coulson, 2004; Stock et al., 2007) and accompany social affective states like shame and embarrassment (Costa et al., 2001; Ekman & Rosenberg, 2005).

Postures are considered among the most reliable and honest nonverbal cues as they are typically assumed unconsciously (Richmond & McCroskey, 1995). Following the seminal work in (Scheflen, 1964), postures convey three main kinds of social messages: inclusion and exclusion (we exclude others by orienting our body in the opposite direction with respect to them), engagement (we are more involved in an interaction when we are in front of others), and rapport (we tend to imitate others posture when they dominate us or when we like them).

Affective Behaviors

Tech.

emotion Personality status dominance persuasion regulation rapport Speech analysis Computer vision biometry

Nonverbal cues

Physical appearance

Height

Attractiveness

Body shape

Gesture and posture

Hand gestures

Posture

Walking

Face and eye behavior

Facial expressions

Gaze behavior

Focus of attention

Vocal behavior

Prosody

Turn taking

Vocal outbursts

Silence

Space and environment

Distance

Seating arrangement

Table 1. This table shows the most common nonverbal behavioral cues for each code and the affective aspects most commonly related to them. The table has been published in Vinciarelli et al. 2009 and it is

courtesy of A.Vinciarelli, M.Pantic and H.Bourlard.

5

Face and gaze behavior: Not all nonverbal behavioral cues have the same impact on our perception of other's affect and, depending on the context, different cues have different impact (Richmond & McCroskey, 1995). However, facial expressions and, in more general terms, face behaviors are typically the cues that influence most our perception (Grahe & Bernieri, 1999). Nonverbal facial cues account for cognitive states like interest (Cunningham et al., 2004), emotions (Cohn, 2006), psychological states like suicidal depression (Ekman & Rosenberg, 2005) or pain (Williams, 2003), social behaviors like accord and rapport (Ambady & Rosenthal, 1992; Cunningham et al., 2004), personality traits like extraversion and temperament (Ekman & Rosenberg, 2005), and social signals like status, trustworthiness (Ambady & Rosenthal, 1992). Gaze behavior (who looks at whom and how much) plays a major role in exchanging the floor during conversations, and in displaying dominance, power and status.

Vocal Behavior: Vocal behavior accounts for all those phenomena that do not include language or verbal content in speech. The vocal nonverbal behavior includes five major components: prosody, linguistic and non-linguistic vocalizations, silences, and turn-taking patterns (Richmond & McCroskey, 1995). Prosody accounts for how something is said and it influences the perception of several personality traits, like competence and persuasiveness (Scherer, 1979). Linguistic vocalizations correspond to sounds like "ehm", "ah-ah", etc. that are used as words even if they are something different. They typically communicate hesitation (Glass et al., 1982) or support towards others speaking. Non-linguistic vocalizations include cry, laughter, shouts, yawns, sobbing, etc. and are typically related to strong emotional states (we cry when we are very happy or particularly sad) or tight social bonds (we laugh to show pleasure of being with someone). Silences and pauses typically express hesitation, cognitive effort (we think about what we are going to say), or the choice of not talking even when asked to do so. Last, but not list, turn-taking, the mechanism through which people exchange the floor in conversations, has been shown to account for roles, preference structures, dominance and status, etc.

Space and Environment: Social and physical space are tightly intertwined and, typically, the distance between two individuals corresponds to the kind of relationship they have, e.g. intimate (less than 0.5 meters in western cultures), casual-personal (between 0.5 and 1.2 meters) or socio-formal (between 1 and 2 meters) following the terminology in (Hall, 1959). Furthermore, the kind of relationship between people sitting around a table influences the seating positions, e.g. people collaborating tend to sit close to one another, while people discussing tend to sit in front of one another (Lott & Sommer, 1967).

Technology of Nonverbal Communication

Is it possible to make technological value out of social psychology findings about nonverbal communication? This is a core question for domains like affective computing (Picard, 2000) and Social Signal Processing (Vinciarell et al., 2009; Vinciarelli, 2009), where nonverbal behavioral cues are used as a physical, machine detectable evidence of emotions and social relational attitudes, respectively. Both domains start from the simple consideration that we sense nonverbal behavioral cues (most of the times unconsciously) through our eyes and ears. Thus, it must be possible to sense the same nonverbal cues with cameras, microphones and any other suitable sensor. Furthermore, both domains consider that there is an inference process (in general unconscious) between the behavior we observe and the perceptions we develop in terms of emotional and social phenomena. Thus, automatic inference approaches, mostly based on machine learning, could be used to automatically understand emotional and social phenomena.

Figure 1 shows the main technological components involved in approaches for automatic understanding of nonverbal communication. The scheme does not correspond to any approach in particular, but any work in the literature matches, at least partially, the process depicted in the picture. Furthermore, the scheme illustrated in Figure 1 is not supposed to describe how humans work, but only how machines can understand social and affective phenomena. Overall, the process includes four major steps described in more detail in the rest of this section.

6

Capture: Human behavior can be sensed with a large variety of devices, including cheap webcams installed on a laptop, fully equipped smart meeting rooms where several tens of microphones and cameras record everything happens (McCowan et al., 2003; Waibel et al., 2003), mobile devices equipped with haptic and proximity sensors (Raento et al., 2009; Murray-Smith, 2009), pressure captors that detect posture and movements (Kapoor et al., 2004), eyefish cameras capturing spontaneous interactions, etc. Capture is a fundamental step because, depending on the sensors, certain kinds of analysis will be possible and others not. However, what is common to all possible sensing devices is that they give as output signals and these must be analyzed automatically to complete the whole process.

Figure 1. This picture draws a parallel between the communication process as it takes place between humans and as it is typically implemented in a machine. The correspondence does not mean that the process implemented in the machine actually explains and or described a human-human communication process, but simply helps to understand how technology deals with nonverbal communication.

Person detection: In general, signals obtained through capture devices portray more than one person. This is the case, for example, of audio recordings where more than one person talk, of video recodings where different persons interact with one another, etc. This requires a person detection step aimed at identifying what parts of the data correspond to which person. The reason is that nonverbal behavioral cues can be extracted reliably only when it is clear what individual corresponds to a signal under analysis. Person detection includes technologies like speaker diarization, detecting who talks when in audio data (Tranter & Reynolds, 2006), face detection, detecting what part of an image corresponds to the face of one person (Yang et al., 2002), tracking, following one or more persons moving in a video (Forsyth et al., 2006), etc. The application of one person detection technology rather than another one depends on the capture device, but the result is always the same: the signals to be analyzed are segmented into parts corresponding to single individuals.

Behavioral cues detection: The technological components described so far can be considered as a preprocessing phase that gives the raw data a form suitable for actual analysis and understanding of nonverbal communication. Behavioral cues are the perceivable stimuli that, in the communication process, are used by the emitter to convey information and by the receiver to draw information, possibly the same that the emitter wants to communicate. Detection of nonverbal behavioral cues is the first step of the process that actually deals with nonverbal behavior and it includes well developed domains like facial expression recognition (Zeng et al., 2009), prosody extraction (Crystal, 1969), gesture and posture recognition (Mitra & Acharya, 2007), head pose estimation (Murphy-Chutorian & Trivedi, 2009), laughter detection (Truong & Van Leeuwen, 2007), etc. (see Vinciarelli et al., 2009) for an extensive survey on techniques applied at all processing steps). These are the perceivable stimuli that we both produce and sense in our everyday interactions to communicate with others.

7

Nonverbal behavior understanding: In the communication process, receivers draw information from perceivable stimuli. The information corresponds, in general, to what the emitter actually wants to convey, but this is not necessarily the case. Nonverbal behavior understanding corresponds to this step of the communication process and aims at inferring information like the emotional state or the relational attitude of the receiver from the nonverbal behavioral cues detected at the previous stage of the process. This step of the process relies in general on machine learning and pattern recognition approaches and it is the point where human sciences findings are integrated in technological approaches. Most of the efforts have been dedicated at the recognition of emotions (Picard, 2000) and social signals, i.e. relational attitudes exchanged by people in social interactions (Vinciarelli et al., 2009).

PSYCHOLOGY AND TECHNOLOGY OF FACE-TO-FACE INTERACTIONS

The most natural setting for nonverbal communication is face-to-face interaction, in particular conversations that are considered the "primordial site of social interaction" (Schegloff, 1987). As such, conversations are the natural context for a wide spectrum of social phenomena that have a high impact on our life as well as on the life of the groups we belong to (Levine & Moreland, 1998), whether these are work teams expected to accomplish some complex collaborative tasks, circles of friends trying to organize an entertaining Saturday evening, or families aimed at supporting the well being of their members.

This section focuses in particular on those social phenomena that have been not only investigated from a psychological point of view, but that have been the subject of technological research as well (Vinciarelli et al., 2008, 2009).

Psychology of Face-to-Face Interactions

Three main social phenomena recognized as fundamental by psychologists have been addressed by computer scientists as well, namely roles, dominance and conflict (or disagreement). This section provides a description of each one of them.

Roles are a universal aspect of human-human interaction (Tischler, 1990), whenever people convene to interact, they play roles with the (unconscious) goal of fulfilling others expectations (if you are the head of a group you are expected to provide guidance towards the fulfillment of group goals), give meaning to their behaviors (helping a patient as a doctor is a professional duty while helping the same patient as a family member is a form of love and attachment), and provide predictability to other interactants (when teachers enter their classroom it is likely they will give a lecture and this helps students to behave accordingly). Some roles correspond to explicit functions (like the examples given above) and can be easily identified and formalized, while others are more implicit and embody deeper aspects of humanhuman interaction like the attacker, the defender or the gate-keeper in theories of human interactions (Bales, 1950). From a behavioral point of view, roles corresponding to explicit functions tend to induce more regular and detectable behavioral patterns than others and are thus easier to be analyzed automatically (Salamin et al., 2009).

Conflict and disagreement are among the most investigated social phenomena because their impact on the life of a group is significant and potentially disruptive. In some cases, conflicts foster innovation and enhance group performance, but in most cases they have a contrary effect and can lead to the dissolution of the group (Levine & Moreland, 1998). From a social point of view, the most salient aspects of conflicts and disagreement are activities of some of the members that have negative effects on others, attempts of increasing power shares at the expense of others, bargaining between members and formation of coalitions (Levine & Moreland, 1990). In terms of nonverbal behavior, conflicts are typically associated with interruptions, higher fidgeting and voice loudness typical of anger, pragmatic preference structures

8

such that people tend to react to those they disagree with rather than to those they agree with (Bilmes, 1988; Vinciarelli, 2009), longer periods of overlapping speech, etc.

Dominance accounts for ability to influence others, control available resources, and have higher impact on the life of a group, whatever its goal is. Dominance can be interpreted as a personality trait (the predisposition to dominate others), or as a description of relationships between group members (Rienks & Heylen, 2006). While being a hypothetical construct (it cannot be observed directly), dominance gives rise to a number of nonverbal behavioral cues that allow observers to agree on who is (or are) the dominant individuals in a given group. These include seating in positions allowing direct observation of others like the shortest side of a rectangular table (Lott & Sommer, 1967), being looked at by others more than looking at others (Dovidio & Ellyson, 1982), talking longer than others (Mast, 2002), etc.

Technology of Face-to-Face Interactions

Given the centrality of small group interactions in psychology research, it is not surprising to observe that computing technology efforts aimed at the analysis of social and affective phenomena have focused on face-to-face interaction scenarios like meetings, talk-shows, job interviews, etc. (Vinciarelli et al., 2008, 2009). This section proposes a brief survey of the most important approaches dedicated to this problem in the literature, with particular attention to those dealing with role recognition, conflict and disagreement analysis, and dominance detection, i.e. those dealing with the social phenomena identified above as among the most important ones from a social psychology point of view. Table 2 reports results and some of the experimental characteristics of the works surveyed in this section.

Role recognition is typically based on automatic analysis of speaking activity, the physical, machine detectable aspect of behavior that seems to be more correlated with the roles people play in a conversation. By speaking activity is meant here the simple act of speaking or remaining silent, the use of certain words rather than others, the tendency to speak while others are speaking, the number and length of turns during a conversation, etc. Temporal proximity of different speakers interventions is used in (Vinciarelli, 2007; Salamin et al., 2009) to build social networks and represent each person with a feature vector. This is then fed to Bayesian classifiers mapping individuals into roles belonging to a predefined set. A similar approach is used in several other works (Barzilay et al., 2000; Liu, 2006; Garg et al., 2008; Favre et al., 2009) in combination with approaches for the modeling of lexical choices like the BoosTexter (Barzilay et al., 2000) or the Support Vector Machines (Garg et al., 2008). Probabilistic sequential approaches are applied to sequences of feature vectors extracted from individual conversation turns in (Liu, 2006; Favre et al., 2009), namely Maximum Entropy Classifiers and Hidden Markov Models, respectively. An approach based on C4.5 decision trees and empirical features (number of speaker changes, number of speakers talking in a given time interval, number of overlapping speech intervals, etc.) is proposed in (Banerjee & Rudnicky, 2004). A similar approach is proposed in (Laskowski et al., 2008), where the features are probability of starting speaking when everybody is silent or when someone else is speaking, and role recognition is performed with a Bayesian classifier based on Gaussian distributions. The only approaches including features non related to speaking activity are presented in (Zancanaro et al., 2006; Dong et al., 2007), where fidgeting is used as an evidence of role. However, the results seem to confirm that speaking activity features are more effective.

Conflict and disagreement analysis is a domain attracting increasingly significant interest in the last years (Bousmalis et al., 2009). Like in the case of roles, behavior evidences based on speaking activity seem to account reliably for conflict, agreement and disagreement, though psychology insists on the importance of facial expressions, head nods and bodily movements (Poggi, 2007). The coalitions forming during television debates are reconstructed in (Vinciarelli, 2009) through a Markov model keeping into account that people tend to react to someone they disagree with more than to someone they agree with. Similarly, pairs of talk spurts (short turns) are first modeled in terms of lexical (which words are uttered),

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download