Predicting interviewee attitude and body language from speech descriptors

1 Predicting intervie wee attitude and b ody language from speech descriptors Yosef Solewicz 1 , Chagay Orenshtein 2 and Avital Friedland 1 National Police, Israel 2 Tel Hai College, Israel Abstract This present research investigated the relationship between pe rsonal impressions and the acoustic nonve rbal communication conveye d by employees being interviewed. First, we investigated the extent to which different conversation topics addressed during the interview induced changes in the interviewees' acoustic parameters. Nex t, we attempt ed to predict the observed and self-assessed attitudes and body language of the interviewees based on the acoustic data. The results showed that topicalit y caused significant deviations in the acoustic para meters statistics, but the abilit y to predict the personal perceptions o f the interviewees based on their acoustic non-verbal communication was relatively independent of topicality, due to the natural redundan cy inherent in acoustic attributes. Our findings suggest that joint modeling of speech and visual cues may improve the assessment of interviewee profiles. Keywords: speech, body language, human – computer interac tion, tracking/perception, gesture analysis, nonverbal communication, bod y gestures (multi-modality), e- interviews, conversation topics 2 1. Introduction 1.1 The study rational Human affect sensing can be achieved using a broad range of behavioral cues a nd signals that are available through diverse channels. Affective states can be recognized based on visible signals, such as gestures (facial expressions, body gestures, head movements, and the like), speech (parameters such as pitch, energy , frequency, and duration), or covert physiological signals (respiratory , cardiac activity , and e lectrodermal activity). According to cognitive neuroscience research, information coming from various modalities is combined in our brains to yield multi -modally determined percepts (Driver & Spence, 2000). In real-life sit uations, our different senses receive correlated information about the same external event. This redundancy can be helpful when some of the channels that convey si gnals are unavailable, such as in a telephone conversation, when there is no visual feedback from the interlocutor , or in order to en hance speech perception when the audio is corrupted by noise. Furthermore, the multiple signals make it possible for a person assessing someone else's emotional or affective state to consider significantly v ariable conditions and select alterna tive channels from the mul tiple input modalities in order to grasp the emotions being transmitt ed (Gunes, Piccardi, & Pantic, 2008). However, what happens if the different information channels send different messages to a n obs erver? People seem to be able to differentiate b etween honest and untrustworthy c hannels. In fact, wh en verbal a nd nonverbal speech signals contradict each other, people generally trust the latter more, since it unconsciously broadcasts one's true feelings (Ambady & Rosenthal, 1992). The present research focused on the relationship between visu al and acoustic information channels. D uring interpersonal communication, speec h and body gestures coordinate the encoding of nonverbal intents in order to conve y underlying internal emotion states (Condon, 1976; Kendon, 1980; McNeill, 1985, 1996). Research has shown that as many as 90% of body gestures are a ssociated with speech, not only regulating the interactions and punctuating discourse, or representing t ypica l social signals but also emphasizing the speakers ’ thoughts as the y oc cur (see, e.g., McNeill, 1996). These channels are connected at both the behaviora l and neural levels (Healey & Braun, 2013). In fact, the relationshi p between acoustic feature s and gesture s has been the subject of 3 extensive research and s ome controversy. Alth ough it is widel y agreed th at gesture and speech reflect the same cognitive process, some researchers have cl aimed that they are independent and parallel processes (e.g., Krauss, 1998). Ac cording to thi s approac h, gestures are seen as an auxiliary channel that supports speech. In a seminal study, Scherer (2003) pre sented a theoretical model of the vocal communication of emotion and reviewed the acoustic correlates o f different emotion al patterns. Cowie et al. (2001) and J uslin and Sch erer (2005) provide d co mprehensive overviews of previous research in the fi eld, and Narayanan and Georgiou (2013) review ed computational techniques for modeling human behavior based on speec h. I t is known that emotions can also be visuall y inferred from gestures, but the mechanisms by which this occurs seem to be unc lear (Coulson, 2004). Nevertheless, the combination of speech and visual information has been shown to improve behavior assessment (Busso & Narayanan, 2007; Gatica-Perez, Vinciarelli, & Odobez, 2014; Pantic & Vinciare lli , 2014; Valstar et al., 2013; Yang & Narayanan, 2014). Unlike the extensive research on sp eech-emotion mapping published to d ate, in the present study we did not e xplicitly address specific emotions or focus on their categorization based on acoustic parameters. Instead, the purpose of the pre sent research was to investigate the g eneral a bility to model perceived body lang uage and other expressive intents b y means of examining the speaker's acoustic non-ve rbal communication. This article follows the sequence of the research process. It begins with anal ysis of the speech parameters measured in the course of the interviews and demonstration of the significant statistical dissimilarities among the parameters that were extracted from different interview sessions. This is followe d b y presentation of models for the prediction of visual inte nts conveyed b y interviewees based on the acoustic parameters, using different interview sessions for training and testing. Finally , we repo rt on the results regarding the robustness of the models, which was tested using data from different sessions. 1.2 Acoustic analysis Vocal nonverbal behavior includes five major components: voice ac oustics, linguistic and non-linguistic vocalizations, silences, and turn-takin g patt erns. Each component refers to different social signals that contribute to different aspec ts of the social perception of a message. In th e present researc h, the acoustic anal ysis procedure 4 was based entirely on two of these linguistic components: voice acoustics a nd silences . We estimated the parameters of the acoustics based on stressed vowels only, because the y are significantly affected by ex pressive speech, and in addition, these segments usually possess a high signal- to -noise ratio. In examining the silence excerpts, we focused on discourse pauses, thus considering only relatively long silent segments. Voice acoustics is a general term, which can be further refined according to the voice produc tion model (for a classical text on this theme, see Rabiner & Schafer, 1978). According to this model, which is often based on the source -filter theory (F ant, 1960), the speec h signal is the result of filtering an excitation source. The excitation signal , which is due to airflow from the lungs, passes thro ugh the vocal cavity and is shaped int o different sounds. For the sake of simplicit y, it is assumed that excitation and filter are decoupled, although this is not entirely true. The excitation signa l can be roughly classified as voiced or unvoiced. The former is formed b y pe riodic impulses of air modulated by the vocal cords (pitch); in the latter, the excitation is aperiodic. Based on this model, acoustic features are cate gorized into three main groups, representing distinct levels within the speech model. The prosody features are those linked to an excitation source at the macro level; they define the intonation and rhythm of speech. At the micro level, the dynamics of the excitation signal define voice-quality characteristics. Finally, spectral (in cluding cepstral) features result from idi osyncrasies of the vocal-tract filter. The features used in the pre sent study and their classification into the three groups – prosodic (P), voice qualit y (Q), and spectral/cepstral (S) – are pr esented in Table 1. Table 1. Speech Features Used in the Experiments Feature abbreviation Type Feature description Spkrate P Total vowel length to total speech length ratio Mean pause P Mean length of pause segments Pauses second P Average number of pauses per second Pause speech ratio P Total pause length to total speech leng th ratio Rhythm P Average number of vowels per second Vowel mean P Average length of vowels Vowel std P Standard deviation of vowel lengths 5 Intensity std P Standard deviation of vowel intensity values F0 std P F0 standard deviation F0 mean P F0 mean Vowel F0 range P Average vowel F0 range Harmonicity Q Harmonic- to -noise ratio mean Jitter loc Q Mean of local jitter Jitter ppq5 Q Mean of five-point period perturbation quotient jitter Shimmer loc Q Mean of local shimmer Shimmer apq5 Q Mean of five-point amplitude perturbation quotient shimmer F1 S Mean of first formant frequency F2 S Mean of first formant bandwidth F3 S Mean of second formant frequency B1 S Mean of second formant bandwidth B2 S Mean of third formant frequency B3 S Mean of third formant bandwidth Cep1 S Mean of first mel-freqency cepstral coefficient (MFCC) Cep2 S Mean of second MFCC Cep3 S Mean of third MFCC Cep4 S Mean of forth MFCC Cep5 S Mean of fifth MFCC Cep6 S Mean of sixth MFCC Cep7 S Mean of seventh MFCC Cep8 S Mean of eighth MFCC 2. Methodology The corpus used in this re search was formed by a series of recorded interviews in Hebrew with a g roup of female emplo yees (mean age = 45 ye ars). The interviewees were staff members at daycare centers for infants of low-income families. All the procedures performed in the stud y t hat involved human participants were in compliance with the ethical standards of the institutional resea rch committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all the individual participants included in the study. 6 Two research assistants conducted and recorded the interviews (one in ea ch interview), in secluded rooms at th e workplaces. At the end of each interview, the research assistant a nd the interviewee each completed a questionnaire regarding several aspects of frontal bod y language, reactions, and the ir general impression of the int erviewee. Both the rese arch assistant and int erviewee rank ed each attribute on a 7-point scale. The same type of digital recorder device and recording setup were used in all interviews, in order to avoid external distortion of the acoustic features. During th e interview, the employ e es were given one minute to talk without interruption about a specific theme, in three successive sessions. The aim was to induce different conv ersation topics, one in each session, during the inte rview . In the first session, the employ ee was asked to describe herself in genera l. In the next session, she was aske d to desc ribe a typical workday . In the third and last session, she wa s asked how she would react to specific h ypothetica l dangerous situations involving the children with whom she worked. After filterin g out poor recordings, we obtained 297 one-minute recordings (99 speakers x 3 conversation topics). The speech parameters were computed separa tely for each of these recordings. At this stage, questionnaires that were not prop erly annotated were also discarded. In t he end, we obtained speech data f rom 69 complete interviews ; these served as the basis of the anal ysis of the bod y-language attributes defined in the questionnaire. For the purpose of speech parametrization, the recorded speech files were first co nverted from mp3 to wav fo rmat and then do wnsampled to 11025 Hz. A phonem e recognizer (Schwarz, Matejka, & Cernocky, 200 6) was used to automatically segment each speech file. The vo wels detected in all the recognized phonemes were organized in order of length. The longest half of the ordered vowels form ed the "stresse d vowels" set. A window of 80 ms around the vowel centers was u sed for prosodic and quali ty parameter estimation. A shorter window of 40 ms was used for spectral/cepstral parameter estimation. The PRAAT software program (Boersma, 2001) was used for calculating the acoustic features. W ith regard to discourse p auses, non-voice excerpts longer than 400 ms were considered "pau ses" and used to derive the prosodic parameters with temporal characteristics. As noted earlier, our ex periments were divi ded int o two main units: the relationships between s everal speech parameters and the defined top ics, and the prediction of body-language and attitude descriptors based on the speech parameters. 7 3. Experiments 3.1 The relationships between speech parameters and conversation topics 3.1.1 Statistical analysis The initial objec tive was to deter mine whether the distinct speec h para meters differed significantl y by conve rsation topic. For this purpose, we employed both the paired t -test (Goulden, 1956) and the Wilcoxon signed-rank test (Wilcoxon, 1945). The paired t -test deter mines whether two paired sets differ from each other in significantly, based on the assumption that the paired differences are independent and identically normally distributed. The W ilcoxon test is the non-parametric analogue of the paired t - test, and should be used if the distribution of differences be tween pairs is non-normal. In our case, we assess ed the differences between speakers in the set of speech features, for pairs of topics. W e employed both tests, because some of the features (most notably the temporal features) did not distribute in a ty pical Gaussian shape. Actuall y, except in the cases of a few features, both tests yielded the same re sults. Table 2 shows the test results regardin g differences among Topics 1, 2, and 3. For example, t 1→2 denotes the t -test, and W, the Wilcoxon test outcome for a specific f eature in the passa ge from Topic 1 to Topic 2. An upw ard arrow denotes a positive change in the me an of th e f eature after moving on to the next topic, and a downward ar row denotes a negative c hange; blanks reflect no significant change . Table 2. Changes in Speech Parameters in Transition to Conversation Topics ( N = 99) Feature t 1→2 t 1→3 t 2→3 W 1→2 W 1→3 W 2→3 Spkrate ↑ ↑ ↑ ↑ Mean pause ↓ * ↓ ↓ ↑ Pauses second ↓ ↓ ↓ ↓ ↓ ↓ Pause speech ratio ↓ ↓ ↓ ↓ Rhythm ↑ ↑ ↑ ↑ ↑ ↑ Vowel mean ↓ ↓ ↓ ↓ ↓ ↓ Vowel std ↓ * ↓ * Intensity std ↓ * ↑ ↓ * ↑ F0 std ↑ * ↑ ↑ * ↑ F0 mean ↑ ↑ ↑ ↑ ↑ ↑ 8 Vowel F0 range ↓ * ↑ * ↑ ↓ * ↑ * ↑ Harmonicity ↑ ↓ ↑ ↓ Jitter loc ↓ ↑ * ↓ ↑ Jitter ppq5 ↓ * ↓ Shimmer loc ↓ ↓ * ↑ * ↓ ↓ * ↑ * Shimmer apq5 ↓ ↑ * ↓ * ↑ * F1 ↑ ↑ ↑ ↑ ↑ ↑ F2 ↓ ↓ ↓ ↓ F3 ↑ ↓ * ↑ B1 ↓ ↓ * ↓ ↓ * B2 ↓ * B3 Cep1 ↓ ↓ ↓ ↓ Cep2 ↓ ↓ ↓ ↓ ↓ ↓ Cep3 ↓ ↓ ↓ ↓ ↓ ↓ Cep4 Cep5 ↑ * Cep6 ↓ ↓ ↓ ↓ Cep7 ↑ ↑ * ↑ Cep8 * denotes significance at a level of .05. 3.1.2 Acoustic correlates After the objec tive statistical analysis of ch anges among the dif ferent speech features across the three induced conversation topics, we proceede d to investiga te possible correlations between these attributes and the topics of the different sessions, both quantitative and qualitatively. As noted, durin g S ession 1, the speakers ha d more freedom regarding choice of the t opic; in Session 2, the y were directed to focus on work issues , and the third session focused on un comfortable themes, with the intention of causing the interviewees some degree of stress. Unlike the majority of previo us studies, in the present research we did not direct the interviewees to discuss well-defined topics in or der to assess the corresponding speech parameters. Instead, we compared the relative changes in the speech parameters measured across the different topics with those reported in other studies. It should be noted that 9 caution should be exercised in interpreting acoustic correlates of conversation topics, especially in the c ase of different experiment designs. Diff erent instantiations or variants of specific emotions, even though collectively labeled by th e same tag, ma y be represented by differing acoustic expression patterns (Scherer, 2003). As Table 2 shows, most of the acoustic parameters did reflect si gnificant differences between the diffe rent conversation topics. Taking the interview regarding Topic 1 as the bas eline, we investigated the p rogressive chan ges in th e means of th e parameters in Sessions 2 and 3. In general, the spectral/cepstral and most of the prosodic features, shift ed towards both to pic s in the same direction, either positive or nega tive; this suggests that Topics 2 and 3 were quit e sim ilar, and the re spective parameters differed from the baseline mostl y in the intensit y of change. In contrast, the directions of change in the quality descriptors were not consistent among the topi c s, suggesting that these parameters are more sensitive in capturing qualitative nuances between Top ics 2 and 3. 3.1.3 Quantitative analysis To obtain quantitative insights rega rding these results, we performed a simplified mathematical analysis. In the followin g equation, the general acoustic parameter shift from any Topic a to Topic b is repre sented as vector v . The element i of this vector, v i , is the result of a statistic al significance test for the i th feature in the shift from Topic a to b . For sim plicity, v i can be set a t one of three values: 1 – rejection of the null hypothesis, H 0 – the same statistical distribution for feature i in topic s a and b and an in crease in the mean for this f eature after the move ( μ b > μ a ); - 1 – if it d ecreases; or 0 – otherwise (Equation 1).                                (1) Vector v roughly represents the parameter shift, a nd there fore, quantification of the overall transition f rom one conversation topic to another. We can therefore estimate the sim ilarity between two topic transitions b y mea ns of vector metrics , such as the Euclidean distance or cosine similarity. This is exemplified using the estimated v vectors for Topic s 1, 2, and 3, and calculating the cosine similarity between pairs of transition topics. Note that a cosine distanc e between two transitions close to 1 is an i ndication that the overall acoustic shifts are similar; va lues close to -1 reflect a negative correlation; and values close to 0 are a sign of uncorrelated changes in the acoustic feature s. 10 The numerical results of the sche me described are presented in Table 3. The vectors were estimated using the more general Wilcoxon test, but the t -test yielded sim ilar results, as well . Table 3 depicts the cosine similarit ies among the topic p assages for two significance levels, α. I t c learly sho ws that the acoustic changes when the speaker passed from the ba seline topi c (1) to either Topic 2 or 3 w ere generally similar. This is expressed by a high   󰇍 󰇍 󰇍 󰇍 󰇍    󰇍 󰇍 󰇍 󰇍 󰇍   ) value. On the other hand, the low value found for   󰇍 󰇍 󰇍 󰇍 󰇍    󰇍 󰇍 󰇍 󰇍 󰇍   suggests that the acoustic changes involved in th e transition from Topic 2 to Topic 3 differed from those in the transition from Topic 1 to 2. (Note that since the cosine similarity is not a formal dist ance metric; it does not have the triangle inequ ality property.) Table 3. Cosine Similarity Between Distinct Conversation Topics Transitions ( N = 99) Measurement Similarity (α = 0.01) Similarity (α = 0.05)    󰇍 󰇍 󰇍 󰇍    󰇍 󰇍 󰇍 󰇍   .71 .58    󰇍 󰇍 󰇍 󰇍    󰇍 󰇍 󰇍 󰇍   .26 .00    󰇍 󰇍 󰇍 󰇍    󰇍 󰇍 󰇍 󰇍   .53 .54 However, this process for quantif ying similarities between topic tr ansitions lacks further mathematical formality. In particular, it does not consider correlation effects among the dif ferent ve ctor components. I n futur e research, a more refined analysis should be conducted on a decorrelated projection of the vector space. 3.1.4 Qualitative analysis We also made a brief attempt to identify traces of emotional spe ech regarding the different topics. It should be noted that in this research, the chan ges in speech topicalit y were not o riginally meant to lead to specific emotional speech topics. N evertheless, we found some correlations between the acoustic parameters recorded in the present experiments and the emotional speech pa tterns reported in the literature ( Drioli, Tisato, Cosi, & Tesser, 2003; Grawunder & Winter, 2010; Juslin & Laukka , 2001; Nunes, Coimbra, & Teixeira, 2010; Patel, Scherer, S undberg, & Bjorkner, 2011; S cherer, 2003; Sobin and Alpert, 1999; Yildirim et al., 2004). It is well accepted that h igher-order spectral parameters a re gene rally found to be less sensitive to emotional speech. This was also observed in the present research, due to the absence of statistical ly si gnificant 11 differences in these par ameters amon g the different topic s. On the other hand, m ean duration and sp read have been found to increase in emotional spe ech, w hich wa s not unequivocally support ed in our experiments. F0 mean and spread are expected to increase in positive emotional speech and in fact our r esults supported this, with the exception of the F0 r ange of vowels, which decreased in the 1- to -2 passa ge. This downward pitch inflection might b e associated with traces of disgust. Regarding intensity, its spread generally increases in emotional speech (except for sadness). In our experiment, this trend was observed in the passage from Topics 2 to 3, but not in the passage from Topics 1 to 2. Speech rate (including rhythm) usuall y increases for positi ve and decreases for negative emotional spe ech . Our measurements indicated an increase in both positive and negative speech-rate p arameters (except for 1- to -2). According to previous studies, F2 tends to increase for emotional speech, and F1 variations depend on th e type of emotion. Interestingly, our re sults show ed a decrease in F2 and in crease in F1. Finall y, voice quality parameters have been found to b e an important aspect of emotional speech. Jitter and shimmer seem to be somewhat negative corre lated to harmonicity, over distinct emotions. Our results support thi s general trend. J itter, shimmer, and pit ch variabilit y usually decrease in polite speech. In the p resent research, this was observed in the passage from Session 1 to 2, but the opposite was found in the passage from Session 2 to 3. In summar y , as expected, the present findin gs did not indi cate clear emotion al patterns that character ize d the changes between the diffe rent interview sessions. Broadly speaking, the 1 -2 pass age seemed to be emotionally opaque. A possible explanation could be that the task of des cribing one's work is not really an exciting theme . In comparison, more traces of emotion of the t ype des cribed in literature were found when the speakers moved to Session 3; this might have been anticip ated, considering its sensitive theme. These findings roughly support those of our quantitative anal ysis, which indicated a more dramatic difference in speech parameters when th e speakers moved from Topic 2 to 3 compared with the move from Topic 1 to 2. 3.2 Predicting body language and attitude from speech parameters The second part of our r esearch focused on the abilit y to model the body language and attitude of interviewees based on assessment of their speech pa rameters. As reported earlier, the distinct recording sessions within the interviews led to statistical ly significant differences between the d ifferent conversation topics. However, in these experiments, we assumed that the attitudes of the interviewees during the three sessions were not strongly 12 dependent on the specific session and were represented the general perceived (b y the research assistants) and self-assessed (b y the interviewees ) impressions r eported at the end of each interview. In other words, we consider ed the speech parameters indica ted for the different topics as independent variables, potentiall y explaining or predicting the interviewee's behavior traits, the depende nt variables. Accordingl y, we built separate prediction models for the dependent variables using the assessments o f th e independent variables for each interview session. The results show that a model trained on given data from a specific session can be used with independent data obtained in other interview sessions; in other words, topicality and conversation topic are relatively irrelevant when training prediction models. A stepwise linear regression (Draper & Smith, 1998 ) was emplo ye d to create the prediction model s. This is an iterative technique for selecting the most statistical ly significant independent t erms to f it a multi linear model for prediction of a dependent variable. One of its limitations is that global optimiz ation is not guaranteed, a nd different models can be selected un der different initi al conditions or step sequences. This technique may also suffer from overfitting, which reduces the applicability of models to other datasets. S ymptoms of o verfitting ma y be difficult to identify, due to the high rate of correlation among the speech parameters (m ulticollinearity). A previous decorrelation of the feature space could reduce these problems, and should be considered in the feature. We built an overall prediction model for each of the depend ent variables, but attempted to discard spurious models that could lead to mi sleading descriptive anal yses by m eans of Leave-One-Out Cross Validation (LOOC V) . Specificall y, dif ferent models were iterativel y created for each dependent variable using the left -out sample for testing and the rest of the data for training. We arbitrarily declared a model as stable if: 1. At least 75% of the selected model s created du ring th e L OOCV folds were identical (same independent variables sele cted) to the overall model (trained using the whole data). 2. The ratio between the sample correlation coefficient attained throug h the LOOCV process and that obtained by the overall model w as greater than 0.75. The following train – test pairs of tables summarize the results obtained for the dependent variables that could be successfully p redicted (by means of stable models). The tables pr esent the p redicted d ependent v ariable (DV), either as perceived b y the interviewer (P) or s elf-assessed (SA) b y th e interview; the session (S1, S2 , or S3) from which the independent variables were used to train/test the regression models ; the selected 13 independent variables (predictors) and their corr espondin g regression coefficients and correlation signals (pos itive/negative) in the regression model; and t he regression correlation coefficient, r (predictors of the same t ype are placed in a separate line in the tables, for convenience.) Note that we did not pe rform reg ression training using SA variables, since they were unavailable for several of the attitude labels. Table 4 a. Train Mode for DV "Cooperative" ( N = 69) Training session Correlations/predictors DV type R S1 -.67 Pause – speech ratio -.35 Mean pause +.29 Cep1 P .81 S2 -.73 Pause – speech ratio +.26 B2 P .66 S3 -.78 Pause – speech ratio +.26 Cep6 P .70 Table 4b. Test Mode for DV "Cooperative" ( N = 69) Trained on Tested on DV type R S1 S2 P .59 S3 P .65 S1 SA .53 S2 SA .35 S3 SA .36 S2 S1 P .74 S3 P .65 S1 SA .51 S2 SA .44 S3 SA .32 S3 S1 P .73 S2 P .51 S1 SA .37 S2 SA .21 S3 SA .24 14 Table 5a. Train Mode for DV "Proposed a practical solution" ( N = 69) Trained on Correlations/predictors DV type R S1 -.77 Pause – speech ra tio -.32 I ntensit y std -.28 Mean pitch P .63 S2 -.71 Pause – speech ratio +.37 B2 P .60 S3 -.77 Pause – speech ratio -.29 Intensity std +.32 Vo wel F0 range +.36 Cep6 P .70 Table 5b. Test Mode for DV "Proposed a practical solution" ( N = 69) Trained on Tested on DV type R S1 S2 P .56 S3 P .60 S1 SA .48 S2 SA .23 S3 SA .35 S2 S1 P .54 S3 P .57 S1 SA .50 S2 SA .31 S3 SA .36 S3 S1 P .54 S2 P .51 S1 SA .50 S2 SA .23 S3 SA .38 15 Table 6a. Train Mode for DV "Serene" ( N = 69) Trained on Correlations/predictors DV type R S1 -.83 Pause – speech ratio +.43 cep1 +.60 cep4 P .57 S2 -.71 Pause – speech ratio P .42 S3 -.95 P ause – speech ratio - .54 shimmer apq5 +.67 c ep4 +.36 cep6 P .70 Table 6b. Test Mode for "Serene" ( N = 69) Trained on Tested on DV type R S1 S2 P .41 S3 P .54 S1 SA .27 S2 SA .15 S3 SA .28 S2 S1 P .43 S3 P .42 S1 SA .44 S2 SA .28 S3 SA .35 S3 S1 P .46 S2 P .40 S1 SA .32 S2 SA .09 S3 SA .24 Table 7a. Train Mode for DV "Hesitant" ( N = 69) Trained on Correlations/predictors DV type R S1 +.96 Pause – speech ratio +.40 Vowel std -.56 cep4 P .70 S2 +.65 Pause – speech ratio +.35 Rhythm -.53 cep4 P .57 S3 +.73 Pause – speech ratio +.31 Spkrate -.53 cep4 P .60 16 Table 7b. Test Mode for DV "Hesitant" ( N = 69) Trained on Tested on DV type R S1 S2 P .49 S3 P .55 S1 SA .23 S2 SA .14 S3 SA .15 S2 S1 P .61 S3 P .57 S1 SA .23 S2 SA .22 S3 SA .17 S3 S1 P .63 S2 P .48 S1 SA .21 S2 SA .16 S3 SA .16 Table 8a. Train Mode for DV "Determined" ( N = 69) Trained on Correlations/predictors DV type R S1 -.96 Pause – speech ratio +.47 Cep1 +.40 Cep4 P .66 S2 -.81 Pause – speech ratio P .54 S3 -.78 Pause – speech ratio +.45 Cep6 P .55 17 Table 8b. Test Mode for DV "Determined" ( N = 69) Trained on Tested on DV type R S1 S2 P .54 S3 P .54 S1 SA .38 S2 SA .22 S3 SA .38 S2 S1 P .56 S3 P .46 S1 SA .44 S2 SA .33 S3 SA .41 S3 S1 P .50 S2 P .41 S1 SA .31 S2 SA .17 S3 SA .37 Table 9a. Train Mode for DV "Answered properly" ( N = 69) Tr ained on Correlations/predictors DV type R S1 -.69 P ause – speech ratio -.30 I ntensit y std +.30 Jitter ppq5 +.37 F2 +.29 Cep1 P .64 S2 -.59 Pause – speech ratio +.44 B2 P .59 S3 -.61 Pause – speech ratio -.26 Intensity std +.30 Vowel F0 range +. 41 Cep6 P .65 18 Table 9b. Test Mode for DV "Answered properly" ( N = 69) Trained on Tested on DV type R S1 S2 P .50 S3 P .57 S1 SA .23 S2 SA .16 S3 SA .08 S2 S1 P .48 S3 P .49 S1 SA .18 S2 SA .09 S3 SA -.04 S3 S1 P .47 S2 P .43 S1 SA .18 S2 SA .20 S3 SA .06 Table 10a. Train Mode for DV "Tremulous" ( N = 69) Trained on Correlations/predictors DV type R S1 +.19 Pause – speech ratio -.26 Cep1 -.40 Cep4 +.22 Cep7 P .59 S2 +.37 Pause – speech ech ratio -.25 B2 +.15 Cep2 -.36 Cep4 P .71 S3 -.34 Mean pause -.26 Pauses second +.66 Pause – speech ratio +.32 Shimmer apq5 -.18 Cep4 -.17 Cep6 P .67 19 Table 10b. Test Mode for DV "Tremulous" ( N = 69) Trained on Tested on DV type R S1 S2 P .53 S3 P .40 S2 S1 P .36 S3 P .42 S3 S1 P .22 S2 P .58 Table 11a. Train Mode for DV "Turned face aside" ( N = 69) Trained on correlations/predictors DV type R S1 -.41 Jitter loc -.40 Cep1 +.44 Cep7 P .47 S2 +.43 Pause – speech ratio -.56 Cep1 +.42 Cep7 P .50 S3 -.39 Vowel F0 range -.37 B3 -.56 cep6 +.64 Cep7 P .60 Table 11b. Test Mode for DV "Turned face aside" ( N = 69) Trained on Tested on DV type R S1 S2 P .40 S3 P .43 S2 S1 P .35 S3 P .21 S3 S1 P .30 S2 P .26 20 Table 12a. Train Mode for DV "Breathed rapidly" ( N = 69) Trained on Correlations/predictors DV type R S1 +.46 Mean pause -.56 Cep4 P .65 S2 -.31 Pauses second +.65 Pause – speech ratio +.22 F2 -.26 Cep4 P .71 S3 -.34 Pauses second +.74 Pause – speech ratio +.44 Shimmer apq5 -.42 Cep4 P .74 Table 12b. Test Mode for DV "Breathed rapidly" ( N = 69) Trained on Tested on DV type R S1 S2 P .55 S3 P .50 S2 S1 P .58 S3 P .65 S3 S1 P .58 S2 P .56 4. Discussion This research investigated the ability to pr edict bod y l anguage and behavioral traits based on speech descriptors during int erviews. The bod y language a nd behavioral reactions used in the models were either collected as perceived by two re search assistants and self-assessment opinions. The former option led to better modeling, which could be attributed to the mor e consistent ranking scale used b y the assistants. Subjects were recorded on three distinct sessions. In general, in a ll three sessions, stab le prediction models could be trained successfully. Examples of less successful models were those 21 attempting to predict: e ye/mouth/lip/hand/finger movements, posture, coughing, scratching, laughing, joyfulness, blushing, and stress. One of the interesting findings of this research concerns the general freedom allowed re garding discou rse topicalit y in the creation of the prediction mode ls. As shown earlier, different interview sessions focusing on distinct topics differed significantly in terms of s everal speech descriptors and also led to different prediction models. Nevertheless, our results show that equall y efficient models could b e trained and further tested on speech parameters processed from different interview sessions. This further support the idea that speech info rmation generally represents a flexible, well - synchronized, and robust channe l for decoding the visual intents conv eyed b y the interviewee. Regarding the composition of the prediction models, a combination of prosodic, spectral, and voice-quality predictors was often found. According to the regressions, prosodic predictors, and in particular pause speec h ratio seemed to be the predominant predictors. Furthermore, this fe ature also h ad the hi ghest relative coefficient weight in the regression models. On the other hand, voice-quality features emerged as the least relevant. More specificall y, all the predictors included a combination of pause – speech ratio and some ce pstral parameters. In most cases, pos itive reactions (c ooperative, practical, serene, determined, answered prop erly) were chara cterized by a de crease in pause – speech ratio (consistent with previous findings of Baske tt & Freedle, 1974 and Scherer, 1979 that long pauses and interlocutor latenc y indu ce the perception of incompetence) accompanied b y an increase in ce pstral para meters, and th e opposi te was tru e with regard to negative re actions (hesitant, tremulous, turned face aside, breathing). ( Note that cep 7 consistently displayed an opposite trend compared with other cepstra l coefficients; see Table 2.) Following is a brief summary o f other int eresting prediction models obtained in the resear ch. H esitation was characterized b y an incr ease in speech rate, rhy thm and variation in vow el length, accompanied by in creased periods o f silence, which indicate speech in bursts (A lthough ra pid spee ch rates ha ve also been associated with competence and sociability (Miller, Maru yama, Beaber, & Valone, 1976), the sens ation of hesitance was probably caused by the increase in silences.) A combination of increases in the vowel F0 rang e accompanied b y a decrease in intensity std was selected in two positi ve rea ctions (prac tical, answered properly). According to Hirschberg, 1993, an increase in pitch within certain words is used in an 22 effort to structure the discourse . A dec rease in the vowel F0 range was spot ted in a negative reaction (face aside). The less prevalent voice quality parameters eventually pla yed an im portant role in discriminating between excited (breathed rapidly, tremulous, answered properly) and opaque reactions (fac e a side, serene). Excitation was accompanied by an increase in shimmer and jitter, as also observed by Chung (2000). Decreasing trends in these parameters indicated opaque reactions and show agreement with sim ilar measurements found for polite speaking modes (Grawunder & Winter, 2010). One of the stren gths of the present stud y was the homogeneity o f th e res earch population. All the participants belon ged to the same organization and shared a simil ar social background. We also enriched the data analysis b y emplo ying distinct speech topics and usin g both external- and sel f-assessed subjective interviewee evaluations. In addition, we used the same recording device and setup during all the rec ordings. Finall y, only two researc h assistants conducted the recordings a nd the subjective e valuations. All these factors cont ributed to the consistenc y in the results b y reducing the amount of noise on the measured parameters, at the level of both contents and processing . A few we aknesses should also be noted. These i nclude some ambient noise and background speech in the recordin gs and the limited scope of the res earch p opulation, which included onl y women and was relatively small. We also note the limitations regarding different artificially induce d pressure scenarios during the interviews. These shortcomin gs notwithstanding, the present research contributes an additional step forward in the understanding of bod y language, b y ex amining its relationship with visual and auditory variables in interview situations. A direct application of the reported results could be the development of protocols for ana l y z ing or profiling interviewee behavior during a udio or video chats, in pa rticular, regarding Performance Appraisal Interviews (PA I) (Asmuß (2013) and Asmuß (2008)). This kind of interview discusses the performance of an employee vis-à-vis with his employer and involves scenarios relatively similar to our ex perimental setup. I n further studies, we plan to stud y additional varied research populations, diffe rent scenarios, and additional non - verbal communication variables. 23 References Ambady, N., & Rosenthal, R . (1992). Thin sl ices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychological Bulletin, 256. Asmuß, B . (2013). The emergence of symmetries and as y mmetries in performance appraisal interviews: An int eractional perspective. Economic and Industrial Democracy, 34 (3) 553 – 570. Asmuß B. (2008) Performance appraisals: Preference organization in assessment sequences. Journal of Business Communication 45 (4): 408 – 429. Baskett, G. , & Freedle, R. (1974). Asp ects of lang uage pragmatics and the social perception of lying. Journal of Psycholinguistic Research , 3 , 112-131. Boersma, P. (2001 ). Praat, a system for doing phonetics b y computer. Glot International , 5 (9/10), 341-345. Busso, C., & Narayanan, S. S. (2007 ). I nterrelation be tween speech and facial gestures in emotional utterances: A single subject study. Audio, Speech, and Language Processing, IEEE Transactions on , 15 (8), 2331-2347. Chung, S . J. (2000). Expression and perception of emotion extracted from the spontaneous speech in Korean and in English (Un published doctoral dissert ation). Sorbonne Nouvelle University, Paris. Condon, W. (1976). An analysis of behavioral orga nization Sign Language Studies , 13 (1), 285-318. Coulson, M. (2004). Attributing emotion to static bod y postures : re cognition ac curacy, confusions, and viewpoi nt dependence. Journal of Nonverbal Behavior, 28 (2), 117-139. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kolli as, S., Fellenz, W ., & Taylor, J. G. (2001). Emotion recognition in human -computer interaction. Signal Processing Magazine , IEEE , 18 (1), 32-80. Draper, N., & Smith, H. (1998). Applied regression analysis . New York, NY: Wiley. Drioli, C., Tisato, G., Cosi, P., & Tesser, F. (2003). Emotions and v oice qualit y : Experiments with sinusoidal modeling. In C. d 'Alessandro & K. R . S cherer ( Eds.), Voice quality: Functions, analysis and synthesis (VOQUAL'03) . Geneva, Switzerland: ISCA Tutorial and Research Workshop. Driver, J., & Spence, C. (2000). Multisensory perception: Beyond modularity and convergence. Current Biology, 10 , R731-R735. 24 Fant, G. (1960). Acoustic theory of speech product ion . The Ha gue, Netherlands: Mouton. Gatica-Perez, D., Vinciarelli, A., & Odobez, J. M. (2014). Nonverbal Behavior Analysis. In Multimodal Interactive Systems Management (pp. 165-187). EPFL Press. Goulden, C. (1956). Methods of statistical analysis (2d ed.). New York: Wiley. Grawunder, S., & Wint er, B. (2010). Acoustic correlates of politeness: Prosodic and voice quality measures in polite and informal speech of Korean and German speakers. In Proceedings of the fifth international conference on speech prosody . Chicago. Gunes, H., P iccardi, M., & Pantic, M. (2008). From the lab to the real world: Affect recognition using multiple cues and modalities. Affective Computing: Focus on Emotion Expression, Synthesis, and Recognition , pp. 185 – 218, I-Tech Ed uc. & Publishing Affective Comput ing: Focus on Emotion Expression, S y nthesis, and Recognition, C h. From the Lab to the Real W orld: Affect Recognition using Mult iple Cues and Modalities, pp. 185 – 218, I-Tech Educ. & Publishing, 2008 Healey, M. & Braun, A. (2013). Shared n eural correlates for speech and gesture, functional bra in mapping and the endeavor to understand the workin g brain. In F. Signorelli (Ed.), InTech . doi: 10.5772/56493. Available from: http://www.intechopen.com/books/functional-brain-mapping-and-the-endeavor- to -understand-the-working-brain/shared-neural-correlates-for-speech-and- gesture Hirschberg, J. (1993). Pitch accent in contex t: Predicting intonational p rominence from text. Artificial Intelligence , 63 (1 -2), 305 – 340. Juslin, P ., & L aukka P. (2001). Impact of intend ed emotion intensity on c ue utilization and decoding accuracy in vocal expression of emotion. Emotion , 1 (4), 381 – 412. Juslin, P., & S cherer, K. (2005). Voca l expression of affect. I n J. Harrigan, R. Rosenthal, & K. Scherer, (E ds.), The new handbook of methods in nonve rbal behavior research (pp. 65 – 135). Oxford, UK: Oxford University Press. Kendon, A. (1980). Gesture and spe ech: Two a spects of the proce ss of utterance. I n M.R. Key (Ed.), Nonverbal communication and lan guage (pp. 207-227). The Hague: Mouton. 25 Krauss, R. (1998 ). Why do we gesture when we sp eak? Current Directions in Psychological Science, 7 , 54 – 59. Mc Neill, D. (1985). So you thi nk gestures are nonve rbal? Psychological Review , 92 (3), 350-371. McNeill, D. (1996). Ha nd and mind: What ges tures reveal about thou ght . Chicag o: University of Chicago Press. Miller, N., Maru ya ma, G., B eaber, R. J., & Va lone, K. (1976). Spe ed of speech and persuasion. Journal of Personality and Social Psychology , 34, 615-624. Narayanan, S., & Georgiou, P. G. (2013). B ehavioral signal processing: Deriving human behavioral informatics from speech and lan guage. In Proceedings of the IEEE, 101(5) (pp. 1203-1233). Nunes, A., Coimbra, L., & Teixeira, A. (2010). Voice quality of European Portuguese emotional speech. Computational Processing of the Portuguese Language Lecture Notes in Computer Science , 6001 , 142 – 151. Pantic, M., & Vinciar elli, A. (2014 ). Social sig nal processing. In R . Calvo, S . D' Mello, J. Gratch, & A. K appas (Eds.), T he Oxford handbook of affective computing , (p. 84). doi:10.1093/oxfordhb/9780199942237.001.0001 Patel, S ., Scherer K., Sun dberg, J ., & Bjorkner, E. (2011). Mapping mot ions into acoustic space: The role of voice production. Biological Psychology, 87 , 93-98. Rabiner, L ., & Schafer, R. (1978). Digital processing of speech signals . Englewood Cliffs, NJ: Prentice-Hall. Scherer, K. (1979). Pers onality markers in speec h. I n K. R. Scherer & H. Giles (Eds.), Social markers in spe ech (pp. 147-209). Cambridge, UK: C ambridge Universit y Press. Scherer, K. (2003). Vo cal communication of emotion: A review of research paradigms. Speech Communication, 40 (1 -2), 227-256. Schwarz, P., Matejka, P., & Cernoc k y, J. (2006). Hie rarchical structures of neura l networks fo r phoneme recognition. In Proceedings of ICASSP , 2006 (pp. 325 – 328). Sobin, C. & Alpert, M. (1999). Emotion in speech: The acoustic attributes of fear, a nger, sadness, and joy. Journal of Psycholinguistic Research , 28 (4), 347-365. Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., & Pantic, M. (2013, October). AVEC 2013: The c ontinuous audio/visual emotion and depression 26 recognition challenge. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge (pp. 3-10). New York, NY: ACM. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin , 1 (6), 80 – 83. Yang, Z., & Nara yanan, S . (2014). Anal ysis of emotional effect on spe ech -body gesture interplay. In Proceedings of Interspeech (pp. 1934-1938). Yildirim, S., Bulut, M., Lee, C., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., & Narayanan, S. (2004). A n acoustic stud y of emotions expressed in speech. In 8th International C onference of Spoken Language Process. (ICSLP’04) (pp. 2193 – 2196) . Jeju Island, Korea.

Predicting interviewee attitude and body language from speech descriptors

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment