Deep Affect Prediction in-the-wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond
Automatic understanding of human affect using visual signals is of great importance in everyday human-machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent …
Authors: Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A. Nicolaou
International J ournal of Computer V ision - Special Issue on Deep Lear ning f or F ace Analysis manuscript No. (will be inserted by the editor) Deep Affect Pr ediction in-the-wild: Aff-W ild Database and Challenge, Deep Ar chitectur es, and Bey ond Dimitrios Kollias ? · Panagiotis Tzirakis † · Mihalis A. Nicolaou ∗ · Athanasios Papaioannou k · Guoying Zhao 1 · Bj ¨ orn Schuller 2 · Irene Kotsia 3 · Stefanos Zafeiriou 4 Accepted: 29 January 2019 Abstract Automatic understanding of human affect using visual signals is of great importance in e v eryday human- machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent continuous dimensions (e.g., the circumple x model of affect). V alence (i.e., how posi- tiv e or negati ve is an emotion) and arousal (i.e., po wer of the activ ation of the emotion) constitute popular and effec- tiv e representations for affect. Nevertheless, the majority of collected datasets this far , although containing naturalistic emotional states, ha ve been captured in highly controlled recording conditions. In this paper , we introduce the Aff- W ild benchmark for training and ev aluating affect recogni- tion algorithms. W e also report on the results of the First Affect-in-the-wild Challenge (Aff-W ild Challenge) that was recently organized in conjunction with CVPR 2017 on the Aff-W ild database, and was the first ev er challenge on the estimation of valence and arousal in-the-wild. Furthermore, we design and extensi vely train an end-to-end deep neural ? dimitrios.kollias15@imperial.ac.uk † panagiotis.tzirakis12@imperial.ac.uk ∗ m.nicolaou@gold.ac.uk k a.papaioannou11@imperial.ac.uk 1 guoying.zhao@oulu.fi 2 bjoern.schuller@imperial.ac.uk 3 I.K otsia@mdx.ac.uk 4 s.zafeiriou@imperial.ac.uk Queens Gate, London SW7 2AZ, UK ∗ Department of Computing, Goldsmiths Univ ersity of London, London SE14 6NW , U.K 1 , 4 Center for Machine V ision and Signal Analysis, University of Oulu, Oulu, Finland 3 Department of Computer Science, Middlesex Uni versity of London, London NW4 4BT , U.K architecture which performs prediction of continuous emo- tion dimensions based on visual cues. The proposed deep learning architecture, Af fW ildNet, includes con volutional and recurrent neural network (CNN-RNN) layers, exploiting the in variant properties of con volutional features, while also mod- eling temporal dynamics that arise in human behavior via the recurrent layers. The Af fW ildNet produced state-of-the- art results on the Aff-W ild Challenge. W e then exploit the AffW ild database for learning features, which can be used as priors for achieving best performances both for dimen- sional, as well as categorical emotion recognition, using the RECOLA, AFEW -V A and EmotiW 2017 datasets, compared to all other methods designed for the same goal. The database and emotion recognition models are available at http://ibug. doc.ic.ac.uk/resources/first- affect-wild- challenge . Keyw ords deep · con volutional · recurrent · Af f-W ild · database · challenge · in-the-wild · facial · dimensional · categorical · emotion · recognition · valence · arousal · AffW ildNet · RECOLA · AFEW · AFEW -V A · EmotiW 1 Introduction Current research in automatic analysis of facial affect aims at dev eloping systems, such as robots and virtual humans, that will interact with humans in a naturalistic way under real-world settings. T o this end, such systems should auto- matically sense and interpret facial signals relev ant to emo- tions, appraisals and intentions. Moreover , since real-world settings entail uncontrolled conditions, where subjects oper- ate in a di versity of contexts and environments, systems that perform automatic analysis of human behavior should be ro- bust to video recording conditions, the diversity of contexts and the timing of display . 1 1 It is well known that the interpretation of a facial expression may depend on its dynamics, e.g. posed vs. spontaneous expressions [ 66 ]. 2 Dimitrios K ollias ? et al. Fig. 1: The 2-D Emotion Wheel For the past twenty years research in automatic analy- sis of facial behavior was mainly limited to posed behavior which was captured in highly controlled recording condi- tions [ 35 , 41 , 55 , 57 ]. Some representative datasets, which are still used in many recent works [ 27 ], are the Cohn-Kanade database [ 35 , 55 ], MMI database [ 41 , 57 ], Multi-PIE database [ 22 ] and the BU-3D and BU-4D databases [ 62 , 63 ]. Nev ertheless, it is now accepted by the community that the facial expressions of naturalistic behaviors can be radi- cally different from the posed ones [ 10 , 48 , 66 ]. Hence, ef- forts have been made in order to collect subjects display- ing naturalistic behavior . Examples include the recently col- lected EmoPain [ 4 ] and UNBC-McMaster [ 36 ] databases for analysis of pain, the R U-F ACS database of subjects partici- pating in a false opinion scenario [ 5 ] and the SEMAINE cor- pus [ 39 ] which contains recordings of subjects interacting with a Sensiti ve Artificial Listener (SAL) in controlled con- ditions. All the abo ve databases ha ve been captured in well- controlled recording conditions and mainly under a strictly defined scenario eliciting pain. Representing human emotions has been a basic topic of research in psychology . The most frequently used emotion representation is the categorical one, including the seven ba- sic categories, i.e., Anger , Disgust, Fear, Happiness, Sad- ness, Surprise and Neutral [ 14 ] [ 11 ]. It is, howe ver , the di- mensional emotion representation [ 61 ], [ 47 ] which is more appropriate to represent subtle, i.e., not only extreme, emo- tions appearing in ev eryday human computer interactions. T o this end, the 2-D valence and arousal space is the most usual dimensional emotion representation. Figure 1 shows the 2-D Emotion Wheel [ 43 ], with v alence ranging from very positive to v ery ne gati ve and arousal ranging from very activ e to very passiv e. Some emotion recognition databases exist in the liter- ature that utilize dimensional emotion representation. Ex- amples are the SAL [ 21 ], SEMAINE [ 39 ], MAHNOB-HCI [ 53 ], Belfast naturalistic 2 , Belfast induced [ 52 ], DEAP [ 29 ], RECOLA [ 46 ], SEW A 3 and AFEW -V A [ 31 ] databases. Currently , there are many challenges (competitions) in the behavior analysis domain. One such example is the Au- dio/V isual Emotion Challenges (A VEC) series [ 44 , 45 , 56 , 58 , 59 ] which started in 2011. The first challenge [ 49 ] (2011) used the SEMAINE database for classification purposes by binarizing its continuous values, while the second challenge [ 50 ] (2012) used the same database but with its original values. The last challenge (2017) [ 45 ] utilized the SEW A database. Before this and for two consecutiv e years (2015 [ 44 ], 2016 [ 56 ]) the RECOLA dataset was used. Howe ver these databases ha ve some of the belo w limita- tions, as shown in T able 1 : (1) they contain data recorded in laboratory or controlled en- vironments. (2) their di versity is limited due to the small total number of subjects they contain, the limited amount of head pose 2 https://belfast-naturalistic-db .sspnet.eu/ 3 http://sew aproject.eu Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 3 variations and present occlusion, the static background or uniform illumination (3) the total duration of their included videos is rather short T able 1: Databases annotated for both valence and arousal & their attributes. Database no of subjects no of videos duration of each video condition MAHNOB- HCI [ 53 ] 27 20 34 . 9 − 117 secs controlled DEAP [ 29 ] 32 40 1 min controlled AFEW -V A [ 31 ] < 600 600 0 . 5 − 4 secs in-the-wild SAL [ 21 ] 4 24 25 mins controlled SEMAINE [ 39 ] 150 959 5 mins controlled Belfast naturalistic 2 125 298 10 − 60 secs controlled Belfast induced [ 52 ] 37 37 5 − 30 secs controlled RECOLA [ 46 ] 46 46 5 mins controlled SEW A 3 < 398 538 10 − 30 secs in-the-wild T o tackle the aforementioned limitations, we collected the first, to the best of our knowledge, large scale captured in-the-wild database and annotated it in terms of valence and arousal. T o do so, we capitalized on the abundance of data av ailable in video-sharing websites, such as Y ouT ube [ 64 ] 4 and selected videos that display the affecti ve behavior of people, for example videos that display the behavior of peo- ple when watching a trailer , a movie, a disturbing clip, or reactions to pranks. T o this end we have collected 298 videos displaying re- actions of 200 subjects, with a total video duration of more than 30 hours. This database has been annotated by 8 lay experts with reg ards to two continuous emotion dimensions, i.e. valence and arousal. W e then org anized the Af f-W ild Challenge based on the Af f-W ild database [ 65 ] [ 30 ], in con- junction with International Conference on Computer V ision & Pattern Recognition (CVPR) 2017. The participating teams submitted their results to the challenge, outperforming the provided baseline. Howe ver , as described later in this paper , the achiev ed performances were rather low . For this reason, we capitalized on the Aff-W ild database to build CNN and CNN plus RNN architectures shown to achiev e e xcellent performance on this database, outperform- ing all previous participants’ performances. W e have made extensi ve experimentations, testing structures for combin- ing con volutional and recurrent neural netw orks and training them altogether as an end-to-end architecture. W e ha ve used a loss function that is based on the Concordance Correlation 4 The collection has been conducted under the scrutiny and appro v al of the Imperial College Ethical Committee (ICREC). The majority of the chosen videos were under Creative Commons License (CCL). For those videos that were not under CCL, we have contacted the person who created them and asked for their approval to be used in this re- search. Coefficient (CCC), which we also compare it with the usual Mean Squared Error (MSE) criterion. Additionally , we ap- propriately fused, within the network structures, two types of inputs, the 2-D facial images - presented at the input of the end-to-end architecture - and the 2-D facial landmark positions - presented at the 1st fully connected layer of the architecture. W e hav e also in vestigated the use of the created CNN- RNN architecture for valence and arousal estimation in other datasets, focusing on the RECOLA and the AFEW -V A ones. Last but not least, taking into consideration the large in-the- wild nature of this database, we show that our network can be also used for other emotion recognition tasks, such as classification of the univ ersal expressions. The only challenge, apart from last A VEC (2017) [ 45 ], using ’in-the-wild’ data is the series of EmotiW [ 16 – 20 ]. It uses the AFEW dataset, whose samples come from movies, TV sho ws and series. T o the best of our kno wledge, this is the first time that a dimensional database and features ex- tracted from it, are used as priors for categorical emotion recognition in-the-wild, exploiting the EmotiW Challenge dataset. T o summarize, there exist several databases for dimen- sional emotion recognition. Howe ver , they hav e limitations, mostly due to the fact that they are not captured in-the-wild (i.e., not in uncontrolled conditions). This ur ged us to create the benchmark Aff-W ild database and organize the Af f-W ild Challenge. The results acquired are presented later in full detail. W e proceeded in conducting experiments and build- ing CNN and CNN plus RNN architectures, including the AffW ildNet, producing state-of-the-art results. The main contributions of the paper are the follo wing: • It is the first time that a large in-the-wild database - with a big variety of: (1) emotional states, (2) rapid emotional changes, (3) ethnicities, (4) head poses, (5) illumination conditions and (6) occlusions - has been generated and used for emotion recognition. • An appropriate state-of-the-art deep neural network (DNN) (AffW ildNet) has been dev eloped, which is capable of learn- ing to model all these phenomena. This has not been tech- nically straightforward, as can be v erified by comparing the Af fW ildNet’ s performance to the performances of other DNNs de veloped by other research groups which partici- pated in the Aff-W ild Challenge. • It is shown that the AffW ildNet has been capable of gener- alizing its kno wledge in other emotion recognition datasets and contexts. By learning complex and emotionally rich features of the AffW ild, the AffW ildNet constitutes a ro- bust prior for both dimensional and categorical emotion recognition. T o the best of our knowledge, it is the first time that state-of-the-art performances are achieved in this way . 4 Dimitrios K ollias ? et al. T able 2: Current databases used for emotion recognition in this paper, their attrib utes and limitations compared to Af f-W ild. Database model of affect condition total no of frames no of videos no of annotators limitations/comments RECOLA valence-arousal (continuous) controlled 345 , 000 46 6 - laboratory en vironment - moderate total amount of frames - small number of subjects (46) AFEW sev en basic facial expressions in-the-wild 113 , 355 1809 3 - only 7 basic expressions - small total amount of frames - small number of annotators - imbalanced expression cate gories AFEW -V A valence-arousal (discrete) in-the-wild 30 , 050 600 2 - very small total amount of frames - discrete valence and arousal v alues - small number of annotators Aff-W ild valence-arousal (continuous) in-the-wild 1 , 224 , 100 298 8 - The rest of the paper is organized as follows. Section 2 presents the databases generated and used in the presented experiments. Section 3 describes the pre-processing and an- notation methodologies that we used. Section 4 begins by describing the Aff-W ild Challenge that was organized, the baseline method, the methodologies of the participating teams and their results. It then presents the end-to-end DNNs which we de veloped and the best performing AffW ildNet architec- ture. Finally experimental studies and results are presented and discussed, illustrating the above dev elopments. Section 5 describes how the AffW ildNet can be used as a prior for other , both dimensional and categorical, emotion recogni- tion problems yielding state-of-the-art results. Finally , Sec- tion 6 presents the conclusions and future work following the reported dev elopments. 2 Existing Databases W e briefly present the RECOLA, AFEW , AFEW -V A data- bases used for emotion recognition and mention their limi- tations which lead to the creation of the Aff-W ild database. T able 2 summarizes these limitations, also showing the su- perior properties of Aff-W ild. 2.1 RECOLA Dataset The REmote COLlaborati ve and Af fecti ve (RECOLA) data- base was introduced by Ringev al et al. [ 46 ] and it contains natural and spontaneous emotions in the continuous domain (arousal and v alence). The corpus includes four modalities: audio, visual, electro-dermal activity and electro-cardiogram. It consists of 46 French speaking subjects being recorded for 9.5 h recordings in total. The recordings were annotated for 5 minutes each by 6 French-speaking annotators (three male, three female). The dataset is divided into three parts, namely , training (16 subjects), validation (15 subjects) and test (15 subjects), in such a way that the gender , age and mother tongue are stratified (i.e., balanced). The main limitations of this dataset include the tightly controlled laboratory en vironment, as well as the small num- ber of subjects. It should be also noted that it contains a mod- erate total number of frames. 2.2 The AFEW Dataset The series of EmotiW challenges [ 16 – 20 ] make use of the data from the Acted F acial Expression In The Wild (AFEW) dataset [ 16 ]. This dataset is a dynamic temporal facial ex- pressions data corpus consisting of close to real world scenes extracted from movies and reality TV sho ws. In total it con- tains 1809 videos. The whole dataset is split into three sets: training set (773 video clips), validation set (383 video clips) and test set (653 video clips). It should be emphasized that both training and v alidation sets are mainly composed of real movie records, howe ver 114 out of 653 video clips in the test set are real TV clips, thus increasing the difficulty of the challenge. The number of subjects is more than 330, aged 1- 77 years. The annotation is according to 7 facial expressions (Anger , Disgust, Fear, Happiness, Neutral, Sadness and Sur - prise) and is performed by three annotators. The EmotiW challenges focus on audiovisual classification of each clip into the sev en basic emotion categories. The limitations of the AFEW dataset include its small size (in terms of total number of frames) and its restriction to only seven emotion categories, some of which (fear , disgust, surprise) include a small number of samples. 2.3 The AFEW -V A Database V ery recently , a part of the AFEW dataset of the series of EmotiW challenges has been annotated in terms of valence and arousal, thus creating the so called AFEW -V A [ 31 ] data- base. In total, it contains 600 video clips that were extracted Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 5 Fig. 2: Frames from the Aff-W ild database which show subjects in different emotional states, of different ethnicities, in a variety of head poses, illumination conditions and occlusions. from feature films and simulate real-world conditions, i.e., occlusions, dif ferent illumination conditions and free mo ve- ments from subjects. The videos range from short (around 10 frames) to longer clips (more than 120 frames). This database includes per-frame annotations of valence and arou- sal. In total, more than 30,000 frames were annotated for dimensional affect prediction of arousal and valence, using discrete values in the range of [ − 10 , +10 ]. The database’ s limitations include its small size (in terms of total number of frames), the small number of annota- tors (only 2) and the use of discrete values for valence and arousal. It should be noted that the 2-D Emotion Wheel (Fig- ure 1 ) is a continuous space. Therefore, using discrete only values for valence and arousal provides a rather coarse ap- proximation of the behavior of persons in their e veryday in- teractions. On the other hand, using continuous values can provide improved modeling of the expressi veness and rich- ness of emotional states met in ev eryday human behaviors. 2.4 The Af f-W ild Database W e created a database consisting of 298 videos, with a total length of more than 30 hours. The aim was to collect sponta- neous facial behaviors in arbitrary recording conditions. T o this end, the videos were collected using the Y outube video sharing web-site. The main keyw ord that was used to re- triev e the videos was ”reaction”. The database displays sub- jects reacting to a variety of stimuli, e.g. viewing an unex- pected plot twist of a mo vie or series, a trailer of a highly an- ticipated movie, or tasting something hot or disgusting. The subjects display both positi ve or neg ativ e emotions (or com- binations of them). In other cases, subjects display emotions while performing an activity (e.g., riding a rolling coaster). In some videos, subjects react on a practical joke, or on pos- itiv e surprises (e.g., a gift). The videos contain subjects from different genders and ethnicities with high v ariations in head pose and lightning. Most of the videos are in YUV 4:2:0 format, with some of them being in A VI format. Eight subjects have annotated the videos following a methodology similar to the one pro- posed in [ 12 ], in terms of v alence and arousal. An online an- notation procedure w as used, according to which annotators were watching each video and provided their annotations through a joystick. V alence and arousal range continuously in [ − 1 , +1 ]. All subjects present in each video have been annotated. The total number of subjects is 200, with 130 of them being male and 70 of them female. T able 3 shows the general attributes of the Aff-W ild database. Figure 2 shows some frames from the Aff-W ild database, with people from different ethnicities displaying various emotions, with dif- ferent head poses and illumination conditions, as well as oc- clusions in the facial area. T able 3: Attributes of the Aff-W ild Database Attribute Description Length of videos 0 . 10 − 14 . 47 min V ideo format A VI , MP 4 A verage Image Resolution (AIR) 607 × 359 Standard deviation of AIR 85 × 11 Median Image Resolution 640 × 360 Figure 3 shows an example of annotated v alence and arousal values over a part of a video in the Aff-W ild, to- gether with corresponding frames. This illustrates the in- the-wild nature of our database, namely , including many dif- ferent emotional states, rapid emotional changes and occlu- sions in the facial areas. Figure 3 also shows the use of continuous values for v alence and arousal annotation, which giv es the ability to effecti vely model all these different phe- nomena. Figure 4 provides a histogram for the annotated values for v alence and arousal in the generated database. 6 Dimitrios K ollias ? et al. 0 200 400 600 800 1 , 000 1 , 200 1 , 400 1 , 600 1 , 800 2 , 000 − 1 − 0 . 5 0 0 . 5 1 F rames Annotations V alence Arousal Fig. 3: V alence and arousal annotations over a part of a video, along with corresponding frames; illustrating (i) the in-the-wild nature of Af f-Wild (different emotional states, rapid emotional changes, occlusions) and (ii) the use of continuous values for valence and arousal Fig. 4: Histogram of valence and arousal annotations of the Af f-Wild database. 3 Data Pre-pr ocessing and Annotation In this section we describe the pre-processing process of the Aff-W ild videos so as to perform face and facial landmark detection. Then we present the annotation procedure includ- ing: (1) Creation of the annotation tool. (2) Generation of guidelines for six experts to follow in or- der to perform the annotation. (3) Post-processing annotation: the six annotators watched all videos again, checked their annotations and performed any corrections; two new annotators watched all videos and selected 2-4 annotations that best described each video; final annotations are the mean of the selected an- notations by these two ne w annotators. The detected faces and facial landmarks, as well as the generated annotations are publicly av ailable with the Aff- W ild database. Finally , we present a statistical analysis of the annota- tions created for each video, illustrating the consistency of annotations achiev ed by using the above procedure. 3.1 Af f-W ild video pre-processing V irtualDub [ 33 ] w as used first so as to trim the raw Y ouTube videos, mainly at their beginning and end-points, in order to remov e useless content (e.g., advertisements). Then, we ex- tracted a total of 1,224,100 video frames using the Menpo software [ 2 ]. In each frame, we detected the faces and gen- erated corresponding bounding box es, using the method de- scribed in [ 38 ]. Next, we extracted facial landmarks in all frames using the best performing method as indicated in [ 8 ]. During this process, we remov ed frames in which the bounding box or landmark detection failed. Failures occurred when either the bounding boxes, or landmarks, were wrongly detected, or were not detected at all. The former case was semi-automatically discovered by: (i) detecting significant Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 7 shifts in the bounding box and landmark positions between consecutiv e frames and (ii) having the annotators verify the wrong detection in the frames. 3.2 Annotation tool For data annotation, we dev eloped our own application that builds on other e xisting ones, like Feeltrace [ 12 ] and Gtrace [ 13 ]. A time-continuous annotation is performed for each affecti ve dimension, with the annotation process being as follows: (a) the user logs in to the application using an identifier (e.g. his/her name) and selects an appropriate joystick; (b) a scrolling list of all videos appears and the user selects a video to annotate; (c) a screen appears that shows the selected video and a slider of valence or arousal v alues ranging in [ − 1 , 1] ; (d) the user annotates the video by moving the joystick ei- ther up or down; (e) finally , a file is created including the annotation values and the corresponding time instances that the annota- tions are generated. It should be mentioned that the time instances gener- ated in the abov e step (e), did not generally match the video frame rate. T o tackle this problem, we modified/re-sampled the annotation time instances using nearest neighbor inter- polation. Figure 5 shows the graphical interface of our tool when annotating valence (the interface for arousal is similar); this corresponds to step (c) of the above described annotation process. Fig. 5: The GUI of the annotation tool when annotating v alence (the GUI for arousal is ex- actly the same). It should also be added that the annotation tool has also the ability to show the inserted valence and arousal anno- tation while displaying a respectiv e video. This is used for annotation verification in a post-processing step. 3.3 Annotation guidelines Six experts were chosen to perform the annotation task. Each annotator was instructed orally and through a multi-page document on the procedure to follow for the task. This doc- ument included a list of some well identified emotional cues for both arousal and valence, providing a common basis for the annotation task. On top of that the experts used their own appraisal of the subject’ s emotional state for creating the an- notations. 5 Before starting the annotation of each video, the experts watched the whole video so as to know what to ex- pect regarding the emotions being displayed in the video. 3.4 Annotation Post-processing A post-processing annotation verification step was also per- formed. Every expert-annotator watched all videos for a sec- ond time in order to verify that the recorded annotations were in accordance with the shown emotions in the videos or change the annotations accordingly . In this way , a further validation of annotations w as achie ved. After the annotations have been validated by the anno- tators, a final annotation selection step followed. T wo new experts watched all videos and, for e very video, selected the annotations (between two and four) which best described the displayed emotions. The mean of these selected annotations constitute the final Aff-W ild labels. This step is significant for obtaining highly correlated annotations, as shown by the statistical analysis presented next. 3.5 Statistical Analysis of Annotations In the following we provide a quantitati ve and rich statisti- cal analysis of the achiev ed Aff-W ild labeling. At first, for each video, and independently for valence and arousal, we computed: (i) the inter-annotator correlations, i.e., the correlations of each one of the six annotators with all other annotators, which resulted in fiv e correlation values per annotator; 5 All annotators were computer scientists who were working on face analysis problems and all had a working understanding of facial ex- pressions. 8 Dimitrios K ollias ? et al. 0 200 400 600 800 1 , 000 1 , 200 1 , 400 1 , 600 1 , 800 2 , 000 2 , 200 2 , 400 0 0 . 2 0 . 4 0 . 6 0 . 8 1 F rames V alence Annotations Annotator 1 Annotator 2 Annotator 3 Annotator 4 (a) 0 200 400 600 800 1 , 000 1 , 200 1 , 400 1 , 600 1 , 800 2 , 000 2 , 200 2 , 400 − 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 F rames Arousal Annotations Annotator 1 Annotator 2 Annotator 3 Annotator 4 (b) Fig. 6: The four selected annotations in a video segment for (a) valence and (b) arousal. In both cases, the value of MAC-S (mean of av erage correlations between these four annotations) is 0.70. This value is similar to the mean MA C-S obtained ov er all Aff-W ild. 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Average Correlation Percen tage of Videos All Annotators (MAC-A) Selected Annotators (MAC-S) (a) 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Average Correlation Percen tage of Videos All Annotators (MAC-A) Selected Annotators (MAC-S) (b) Fig. 7: The cumulativ e distribution of MA C-S (mean of a verage inter-selected-annotator correlations) and MA C-A (mean of av erage inter-annotator correlations) values over all Aff-W ild videos for valence (Figure 7 a) and arousal (Figure 7 b). The Figure shows the percentage of videos with a MAC-S/MA C-A value greater or equal to the values shown in the horizontal axis. The mean MA C-S value, corresponding to a value of 0.5 in the v ertical axis, is 0.71 for valence and 0.70 for arousal. (ii) for each annotator , his/her average inter-annotator cor- relations, resulting in one value per annotator; the mean of those six av erage inter-annotator correlations v alue is denoted next as MA C-A; (iii) the av erage inter-annotator correlations, across only the selected annotators, as described in the pre vious subsec- tion, resulting in one value per selected annotator; the mean of those 2-4 av erage inter-selected-annotator cor- relations values is denoted ne xt as MAC-S. W e then computed ov er all videos and independently for valence and arousal, the mean of MAC-A and the mean of MA C-S computed in (ii) and (iii) abov e. The mean MA C-A is 0.47 for valence and 0.46 for arousal, whilst the mean MA C-S for v alence is 0.71 and for arousal 0.70. An ex- ample set of annotations is shown in Figure 6 , in an effort to further clarify the obtained MAC-S values. It shows the four selected annotations in a video segment for valence and arousal, respectiv ely , with MA C-S v alue of 0.70 (similar to the mean MA C-S value obtained ov er all Aff-W ild). In addition, Figure 7 shows the cumulativ e distribution of MA C-S and MAC-A values ov er all Af f-W ild videos for valence (Figure 7 a) and arousal (Figure 7 b). In each case, two curves are shown. Every point ( x, y ) on these curves has a y value showing the percentage of videos with a (i) MA C-S (red curve) or (ii) MA C-A (blue curve) value greater or equal to x ; the latter denotes an av erage correlation in [0 , 1] . It can be observed that the mean MA C-S value, cor- responding to a v alue of 0.5 in the vertical axis, is 0.71 for valence and 0.70 for arousal. These plots also illustrate that the MAC-S values are much higher than the correspond- ing MAC-A values in both valence and arousal annotation, verifying the ef fectiveness of the annotation post-processing procedure. Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 9 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Average Correlation Percen tage of Videos (i) All Annotations (ii) Selected Annotations (a) 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Average Correlation Percen tage of Videos (i) All Annotations (ii) Selected Annotations (b) Fig. 8: The cumulati ve distrib ution of the correlation between landmarks and the average of (i) all or (ii) selected annotations ov er all Aff-W ild videos for v alence (Figure 8 a) and arousal (Figure 8 b). The Figure shows the percentage of videos with a correlation value greater or equal to the v alues shown in the horizontal axis. Next, we conducted similar e xperiments for the v alence/ arousal average annotations and the f acial landmarks in each video, in order to ev aluate the correlation of annotations to landmarks. T o this end, we utilized Canonical Correlation Analysis (CCA) [ 23 ]. In particular , for each video and in- dependently for valence and arousal, we computed the cor- relation between landmarks and the average of (i) all or (ii) selected annotations. Figure 8 sho ws the cumulativ e distribution of these cor- relations over all Aff-W ild videos for valence (Figure 8 a) and arousal (Figure 8 b), similarly to Figure 7 . Results of this analysis verify that the annotator-landmark correlation is much higher in the case of selected annotations than in the case of all annotations. 4 Dev eloping the AffWildNet This section begins by presenting the first Aff-W ild Chal- lenge that was organized based on the Af f-W ild database and held in conjunction with CVPR 2017. It includes short descriptions and results of the algorithms of the six research groups that participated in the challenge. Although the re- sults are promising, there is much room for improv ement. For this reason we developed our own CNN and CNN plus RNN architectures based on the Af f-W ild database. W e propose the AffW ildNet as the best performing among the dev eloped architectures. Our developments, ablation studies and discussions are presented next. 4.1 The Af f-W ild Challenge The training data (i.e., videos and annotations) of the Aff- W ild challenge were made publicly a v ailable on the 30th of January 2017, followed by the release of the test videos (without annotations). The participants were given the free- dom to split the data into train and validation sets, as well as to use any other dataset. The maximum number of submitted entries for each participant was three. T able 4 summarizes the specific attributes (numbers of males, females, videos, frames) of the training and test sets of the challenge. T able 4: Attributes of T raining and T est sets of Af f-W ild. Set no of males no of females no of videos total no of frames T raining 106 48 252 1 , 008 , 650 T est 24 22 46 215 , 450 In total, ten different research groups downloaded the Aff-W ild database. Six of them made experiments and sub- mitted their results to the workshop portal. Based on the per- formance they obtained on the test data, three of them were selected to present their results to the workshop. T wo criteria were considered for ev aluating the perfor- mance of the networks. The first one is Concordance Cor- relation Coef ficient (CCC) [ 32 ], which is widely used in measuring the performance of dimensional emotion recog- nition methods, e.g., the series of A VEC challenges. CCC ev aluates the agreement between two time series (e.g., all video annotations and predictions) by scaling their corre- lation coefficient with their mean square difference. In this way , predictions that are well correlated with the annota- tions but shifted in value are penalized in proportion to the deviation. CCC takes values in the range [ − 1 , 1] , where +1 indicates perfect concordance and − 1 denotes perfect dis- cordance. The highest the value of the CCC the better the fit between annotations and predictions, and therefore high values are desired. The mean value of CCC for valence and 10 Dimitrios K ollias ? et al. T able 5: Concordance Correlation Coefficient (CCC) and Mean Squared Error (MSE) of valence & arousal predictions provided by the methods of the three participating teams and the baseline architecture. A higher CCC and a lower MSE value indicate a better performance. Methods CCC V alence Arousal Mean V alue MM-Net 0.196 0.214 0.205 F A T A UV A-Net 0.396 0.282 0.339 DRC-Net 0.042 0.291 0.167 Baseline 0.150 0.100 0.125 Methods MSE V alence Arousal Mean V alue MM-Net 0.134 0.088 0.111 F A T A UV A-Net 0.123 0.095 0.109 DRC-Net 0.161 0.094 0.128 Baseline 0.130 0.140 0.135 arousal estimation was adopted as the main ev aluation crite- rion. CCC is defined as follows: ρ c = 2 s xy s 2 x + s 2 y + ( ¯ x − ¯ y ) 2 = 2 s x s y ρ xy s 2 x + s 2 y + ( ¯ x − ¯ y ) 2 , (1) where ρ xy is the Pearson Correlation Coefficient (Pearson CC), s x and s y are the v ariances of all video valence/arousal annotations and predicted v alues, respectiv ely and s xy is the corresponding cov ariance value. The second criterion is the Mean Squared Error (MSE), which is defined as follows: M S E = 1 N N X i =1 ( x i − y i ) 2 , (2) where x and y are the (valence/arousal) annotations and pre- dictions, respecti vely , and N is the total number of samples. The MSE giv es us a rough indication of how the deriv ed emotion model is behaving, pro viding a simple comparative metric. A small value of MSE is desired. 4.1.1 Baseline Arc hitectur e The baseline architecture for the challenge was based on the CNN-M [ 7 ] network, as a simple model that could be used to initiate the procedure. In particular , our network used the con v olutional and pooling parts of CNN-M having been trained on the FaceV alue dataset [ 3 ]. On top of that we added one 4096-fully connected layer and a 2-fully connected layer that provides the v alence and arousal predictions. The inter- ested reader can refer to Appendix A for a short description and the structure of this architecture. The input to the network were the facial images resized to resolution of 224 × 224 × 3 , or 96 × 96 × 3 , with the intensity values being normalized to the range [ − 1 , 1] . In order to train the network, we utilized the Adam opti- mizer algorithm; the batch size was set to 80 , and the initial learning rate was set to 0 . 001 . Training was performed on a single GeForce GTX TIT AN X GPU and the training time was about 4-5 days. The platform used for this implementa- tion was T ensorflow [ 1 ]. 4.1.2 P articipating T eams’ Algorithms The three papers accepted to this challenge are briefly re- ported below , while T able 5 compares the acquired results (in terms of CCC and MSE) by all three methods and the baseline network. As one can see, F A T A UV A-Net [ 6 ] has provided the best results in terms of the mean CCC and mean MSE for valence and arousal. W e should note that after the end of the challenge, more groups enquired about the Aff-W ild database and sent re- sults for ev aluation, but here we report only on the teams that participated in the challenge. In the MM-Net method [ 34 ], a variation of a deep con- volutional residual neural network (ResNet) [ 24 ] is first pre- sented for affectiv e level estimation of facial expressions. Then, multiple memory networks are used to model tem- poral relations between the video frames. Finally , ensemble models are used to combine the predictions of the multiple memory networks, sho wing that the latter steps impro ve the initially obtained performance, as far as MSE is concerned, by more than 10%. In the F A T A UV A-Net method [ 6 ], a deep learning frame- work is presented, in which a core layer, an attribute layer , an action unit (A U) layer and a v alence-arousal layer are trained sequentially . The core layer is a series of con v o- lutional layers, follo wed by the attrib ute layer which ex- tracts facial features. These layers are applied to supervise the learning of A Us. Finally , A Us are employed as mid- lev el representations to estimate the intensity of valence and arousal. In the DRC-Net method [ 37 ], three neural network-based methods which are based on Inception-ResNet [ 54 ] modules redesigned specifically for the task of facial affect estimation are presented and compared. These methods are: Shallo w Inception-ResNet, Deep Inception-ResNet, and Inception- ResNet with Long Short T erm Memory [ 25 ]. Facial features are extracted in different scales and both, the valence and arousal, are simultaneously estimated in each frame. Best results are obtained by the Deep Inception-ResNet method. All participants applied deep learning methods to the problem of emotion analysis of the video inputs. The fol- lowing conclusions can be drawn from the reported results. First, CCC of arousal predictions was really lo w for all three Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 11 methods. Second, MSE of valence predictions was high for all three methods and CCC was low , except for the winning method. This illustrates the difficulty in recognizing emotion in-the-wild, where, for instance, illumination conditions dif- fer , occlusions are present and different head poses are met. 4.2 Deep Neural Architectures & Ablation Studies Here, we present our dev elopments and ablation studies to- wards designing deep CNN and CNN plus RNN architec- tures for the Aff-W ild. W e present the proposed architec- ture, AffW ildNet, which is a CNN plus RNN network that produced the best results in the database. 4.2.1 The Roadmap A. W e considered two network settings: (1) a CNN network trained in an end-to-end manner, i.e., using raw intensity pixels, to produce 2-D predic- tions of valence and arousal, (2) a RNN stacked on top of the CNN to capture tem- poral information in the data, before predicting the affect dimensions; this was also trained in an end-to- end manner . T o extract features from the frames we experimented with three CNN architectures, namely , ResNet-50, VGG- Face [ 42 ] and VGG-16 [ 51 ]. T o consider the contextual information in the data (RNN case) we experimented with both the Long Short-T erm Memory (LSTM) and the Gated Recurrent Unit (GR U) [ 9 ] architectures. B. T o further boost the performance of the networks, we also experimented with the use of facial landmarks. Here we should note that the facial landmarks are provided on-the-fly for training and testing the networks. The fol- lowing tw o scenarios were tested: (1) The networks were applied directly on cropped fa- cial video frames of the generated database. (2) The networks were trained on both the facial video frames as well as the facial landmarks corresponding to the same frame. C. Since the main ev aluation criterion of the Af f-W ild Chal- lenge was the mean v alue of CCC for valence and arousal, our loss function was based on that criterion and was de- fined as: L total = 1 − ρ a + ρ v 2 , (3) where ρ a and ρ v are the CCC for the arousal and va- lence, respectiv ely . D. In order to have a more balanced dataset for training, we performed data augmentation, mainly through over - sampling by duplicating [ 40 ] some data from the Aff- W ild database. W e copied small video parts showing less-populated valence and arousal values. In particular, we duplicated consecuti ve video frames that had neg- ativ e valence and arousal values, as well as positiv e va- lence and ne gati ve arousal values. As a consequence, the training set consisted of about 43% of positive valence and arousal values, 24% of negati ve valence and pos- itiv e arousal values, 19% of positiv e valence and neg- ativ e arousal values and 14% of negati ve valence and arousal v alues. Our main target has been a trade-of f be- tween generating balanced emotion sets and avoiding to sev erely change the content of videos. 4.2.2 Developing CNN ar chitectur es for the Aff-W ild For the CNN architectures, we considered the ResNet-50 and VGG-16 networks, pre-trained on the ImageNet [ 15 ] dataset that has been broadly used for state-of-the-art object detection. W e also considered the VGG-F ace network, pre- trained for face recognition on the VGG-Face dataset [ 42 ]. The VGG-Face has proven to provide the best results, as reported next in the experimental section. It is worth men- tioning that in our experiments we have trained those archi- tectures for predicting both valence and arousal at their out- put, as well as for predicting v alence and arousal separately . The obtained results were similar in the two cases. In all experiments presented next, we focus on the simultaneous prediction of valence and arousal. The first architecture we utilized was the deep residual network (ResNet) of 50 layers [ 24 ], on top of which we stacked a 2-layer fully connected (FC) network. For the first FC layer, best results have been obtained when using 1500 units. For the second FC layer , 256 units provided the best results. An output layer with two linear units followed pro- viding the valence and arousal predictions. The interested reader can refer to Appendix A for a short description and the structure of this architecture. The other architecture that we utilized was based on the con volutional and pooling layers of VGG-Face or VGG-16 networks, on top of which we stack ed a 2-layer FC network. For the first and second FC layers, best results have been obtained when using 4096 units. An output layer followed, including two linear units, providing the valence and arousal predictions. The interested reader can refer to Appendix A for a short description and the structure of this architecture as well. 12 Dimitrios K ollias ? et al. Fig. 9: The AffW ildNet: it consists of con volutional and pooling layers of either VGG-Face or ResNet-50 structures (denoted as CNN), followed by a fully connected layer (denoted as FC1) and two RNN layers with GR U units (V and A stand for valence and arousal respecti vely). In the case when landmarks were used (scenario B.2 in subsection 4.2.1 ), these were input to the first FC layer along with: i) the outputs of the ResNet-50, or ii) the outputs of the last pooling layer of the VGG-F ace/VGG-16. In this way , both outputs and landmarks were mapped to the same fea- ture space before performing the prediction. W ith respect to parameter selection in those CNN archi- tectures, we ha ve used a batch size in the range 10 − 100 and a constant learning rate v alue in the range 0 . 00001 − 0 . 001 . The best results hav e been obtained with batch size equal to 50 and learning rate equal to 0 . 0001 . The dropout probabil- ity value has been set to 0 . 5 . 4.2.3 Developing CNN plus RNN ar chitectur es for the Aff-W ild In order to consider the contextual information in the data, we developed a CNN-RNN architecture, in which the RNN part was fed with the outputs of either the first, or the second fully connected layer of the respectiv e CNN networks. The structure of the RNN, which we examined, con- sisted of one or two hidden layers, with 100 − 150 units, fol- lowing either the LSTM neuron model with peephole con- nections, or the GRU neuron model. Using one fully con- nected layer in the CNN part and two hidden layers in the RNN part, including GRUs, has been found to provide the best results. An output layer followed, including two linear units, providing the v alence and arousal predictions. T able 6 shows the configuration of the CNN-RNN ar- chitecture. The CNN part of this architecture was based on the con volutional and pooling layers of the CNN architec- tures described above (VGG-Face, or ResNet-50) that was followed by a fully connected layer . Note that in the case of scenario B.2 of subsection 4.2.1 , both the outputs of the last pooling layer of the CNN, as well as the 68 landmark 2-D positions ( 68 × 2 values) were pro vided as inputs to this fully connected layer . T able 6 shows the respectiv e number of units for the GRU and the fully connected layers. W e call this CNN plus RNN architecture AffW ildNet and illustrate it in Figure 9 . T able 6: The Af fW ildNet architecture: the fully connected 1 layer has 4096, or 1500 hidden units, depending on whether VGG-F ace or ResNet-50 is used. block 1 VGG-F ace or ResNet-50 con v & pooling parts block 2 fully connected 1 4096 or 1500 dropout block 3 GR U layer 1 128 dropout block 4 GR U layer 2 128 block 5 fully connected 2 2 Network ev aluation has been performed by testing dif- ferent parameter v alues. The parameters included: the batch size and sequence length used for network parameter updat- ing, the value of the learning rate and the dropout probability value. Final selection of these parameters was similar to the CNN cases, apart from the sequence length which was se- Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 13 T able 7: CCC and MSE based ev aluation of valence & arousal predictions provided by the VGG-Face (using the mean of annotators values, or using only one annotator v alues; when landmarks were or were not given as input to the netw ork). CCC W ith Landmarks W ithout Landmarks V alence Arousal V alence Arousal One Annotator 0.39 0.27 0.35 0.25 Mean of Annotators 0.51 0.33 0.44 0.32 MSE W ith Landmarks W ithout Landmarks V alence Arousal V alence Arousal One Annotator 0.15 0.13 0.16 0.14 Mean of Annotators 0.10 0.08 0.12 0.11 lected in the range 50 − 200 and batch size that w as selected in the range 2 − 10 . Best results have been obtained with sequence length 80 and batch size 4 . W e note that all deep learning architectures ha ve been implemented in the T ensor- flow platform. 4.3 Experimental Results In the follo wing we present the af fect recognition results obtained when applying the above deri ved CNN-only and CNN plus RNN architectures to the Aff-W ild database. At first, we have trained the VGG-Face network using two different annotations. One, which is pro vided in the Af f- W ild database, is the average of the selected (as described in subsection 3.4 ) annotations. The second is that of a single annotator (the one with the highest correlation to the land- marks). It should be mentioned that the latter is generally less smooth than the former, average, one. Hence, they are more difficult to be modeled. Then, we tested the two trained networks in two scenarios, as described in subsection 4.2.1 case B, using/not using the 68 2-D landmark inputs. The results are summarized in T able 7 . As was expected, better results were obtained when the mean of annotations was used. Moreov er , T able 7 shows that there is a notable improv ement in the performance, when we also used the 68 2-D landmark positions as input data. Next, we examined the use of various numbers of hid- den layers and hidden units per layer when training and test- ing the VGG-F ace-GR U network. Some characteristic selec- tions and their corresponding performances are shown in T a- ble 8 . It can be seen that the best results hav e been obtained when the RNN part of the network consisted of 2 layers, each of 128 hidden units. T able 9 summarizes the CCC and MSE values obtained when applying all de veloped architectures described in sub- sections 4.2.2 and 4.2.3 , to the Aff-W ild test set. It shows the improv ement in the CCC and MSE values obtained when us- ing the Af fW ildNet compared to all other dev eloped archi- tectures. This improvement clearly indicates the ability of the AffW ildNet to better capture the dynamics in Aff-W ild. In Figures 10(a) and 10(b) , we qualitati vely illustrate some of the obtained results by comparing a segment of the T able 8: Obtained CCC v alues for valence & arousal estima- tion, when changing the number of hidden units & hidden layers in the VGG-Face-GR U architecture. A higher CCC value indicates a better performance. CCC 1 Hidden Layer 2 Hidden Layers Hidden Units V alence Arousal V alence Arousal 100 0.44 0.36 0.50 0.41 128 0.53 0.40 0.57 0.43 150 0.46 0.39 0.51 0.41 obtained v alence/arousal predictions to the ground truth val- ues, in 10000 consecutiv e frames of test data. Moreov er , in Figures 11(a) and 11(b) , we illustrate, in the 2-D valence & arousal space, the histograms of the ground truth labels of the test set and the corresponding predictions of our AffW ildNet. The results shown in T able 9 and the above Figures ver - ify the excellent performance of the AffW ildNet. They also show that it greatly outperformed all methods submitted in the Aff-W ild Challenge. 4.4 Discussing Af fW ildNet’ s Performance The reasons why the AffW ildNet outperformed the other methods are related to both the network design and the net- work training. At first, the AffW ildNet is a CNN-RNN network. The CNN part is based on the VGG-Face (or ResNet-50) net- work’ s conv olutional and pooling layers. The VGG-Face net- work has been pre-trained with a large dataset for face recog- nition (many human faces have been, therefore, used in its construction). In our implementation, this CNN part is followed by a single FC layer . The inputs of this layer are: a) the outputs of the last pooling layer of the CNN part; b) the facial land- marks, which are directly passed as inputs to this FC layer . As a consequence, this layer has the role to map its two types of inputs to the same feature space, before forwarding them to the RNN part. The facial landmarks, which are provided as additional input to the network, in this way , contribute to boosting the performance of our model. The output of the fully connected layer is then passed to the RNN part. 14 Dimitrios K ollias ? et al. T able 9: CCC and MSE based ev aluation of valence & arousal predictions provided by: 1) the CNN architecture when using three different pre-trained networks for initialization (VGG-16, ResNet-50, VGG-Face) and 2) the VGG-F ace-LSTM and AffW ildNet architectures (2 RNN layers with 128 units each). A higher CCC and a lo wer MSE value indicate a better performance. CCC V alence Arousal Mean V alue VGG-16 0.40 0.30 0.35 ResNet-50 0.43 0.30 0.37 VGG-F ace 0.51 0.33 0.42 VGG-F ace-LSTM 0.52 0.38 0.45 AffWildNet 0.57 0.43 0.50 MSE V alence Arousal Mean V alue VGG-16 0.13 0.11 0.12 ResNet-50 0.11 0.11 0.11 VGG-F ace 0.10 0.08 0.09 VGG-F ace-LSTM 0.10 0.09 0.10 AffWildNet 0.08 0.06 0.07 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 · 10 4 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 F rames V alence Annotation Predictions Labels (a) V alence 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 · 10 4 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 F rames Arousal Annotations Predictions Labels (b) Arousal Fig. 10: Predictions vs Labels for (a) v alence and (b) arousal ov er a video segment of the Af f-Wild. The RNN is used in order to model the contextual infor- mation in the data, taking into account temporal variations. The RNN is composed of 2-layers, with GR U units in each layer; the first layer processes the FC layer outputs, the sec- ond layer is follo wed by the output layer that gi ves the final estimates for valence and arousal. Part of Af fWildNet’ s design was the fixing of its optimal hyper-parameters (number of FC and RNN layers, number of hidden units in these layers, batch size, sequence length, dropout, learning rate). Finally , the specification of the loss (a) annotations (b) predictions Fig. 11: Histogram in the 2-D valence & arousal space of: (a) annotations and (b) predictions of Af fW ildNet, on the test set of the Aff-W ild Challenge. function used for network training was another important issue. Our loss function was based on the CCC, as this was the main ev aluation criterion of the Aff-W ild Challenge; this was not the case in the competing methods that used the usual MSE criterion in their training phases. Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 15 As far as network training is concerned, the AffW ild- Net has been trained as an end-to-end architecture, by jointly training its CNN and RNN parts, rather than separately train- ing the two parts. W e would also like to mention that the data augmen- tation that was conducted so as to achiev e a more balanced dataset, also contributed in achieving the AffW ildNet a state- of-the-art performance. 5 F eature Lear ning from Aff-W ild When it comes to dimensional emotion recognition, there exists great v ariability between different databases, espe- cially those containing emotions in-the-wild. In particular , the annotators and the range of the annotations are differ - ent and the labels can be either discrete or continuous. T o tackle the problems caused by this variability , we take ad- vantage of the fact that the Aff-W ild is a powerful database that can be exploited for learning features, which may then be used as priors for dimensional emotion recognition. In the following, we show that it can be used as prior for the RECOLA and AFEW -V A databases that are annotated for valence and arousal, just like Aff-W ild. In addition to this, we use it as a prior for categorical emotion recognition, on the EmotiW dataset, which is annotated in terms of the sev en basic emotions. Experiments hav e been conducted on these databases yielding state-of-the-art results and thus verifying the strength of Aff-W ild for affect recognition. 5.1 Prior for V alence and Arousal Prediction 5.1.1 Experimental Results for the Aff-W ild and RECOLA database In this subsection, we demonstrate the superiority of our database when it is used for pre-training a DNN. In partic- ular , we fine-tune the AffW ildNet on the RECOLA and for comparison purposes we also train on RECOLA an architec- ture comprised of a ResNet-50 and a 2-layer GR U stacked on top (let us call it ResNet-GRU netw ork). T able 10 sho ws the results only for the CCC score as our minimization loss was depending on this metric. It is clear that the performance on both arousal and valence of the fine-tuned model on the Aff-W ild database is much higher than the performance of the ResNet-GR U model. T o further demonstrate the benefits of our model when predicting v alence and arousal, we demonstrate a histogram in the 2-D valence & arousal space of the annotations (Fig- ure 12(a) ) and predictions of the fine-tuned AffW ildNet (Fig- ure 12(b) ) for the whole test set of RECOLA. Finally , we also illustrate in Figures 13(a) and 13(b) the network prediction and ground truth for one test video of T able 10: CCC based ev aluation of v alence & arousal pre- dictions provided by the fine-tuned AffW ildNet and the ResNet-GR U on the RECOLA test set. A higher CCC value indicates a better performance. CCC V alence Arousal Fine-tuned AffWildNet 0.526 0.273 ResNet-GR U 0.462 0.209 (a) annotations (b) predictions Fig. 12: Histogram in the 2-D valence & arousal space of (a) annotations and (b) predictions for the test set of the RECOLA database. RECOLA, for the valence and arousal dimensions, respec- tiv ely . 5.1.2 Experimental Results for the AFEW-V A database In this subsection, we focus on recognition of emotions in the AFEW -V A database, which annotation’ s is somewhat different from the annotation of the Aff-W ild database. In particular , the labels of the AFEW -V A database are in the 16 Dimitrios K ollias ? et al. 0 1 , 000 2 , 000 3 , 000 4 , 000 5 , 000 6 , 000 7 , 000 − 1 − 0 . 5 0 0 . 5 1 F rames V alence Annotations Predictions Labels (a) V alence 0 1 , 000 2 , 000 3 , 000 4 , 000 5 , 000 6 , 000 7 , 000 − 1 − 0 . 5 0 0 . 5 1 F rames Arousal Annotations Predictions Labels (b) Arousal Fig. 13: Fine-tuned AffW ildNet’ s Predictions vs Labels for (a) v alence and (b) arousal for a single test video of the RECOLA database. range [ − 10 , +10 ], while the labels of the Af f-W ild database are in the range [ − 1 , +1 ]. T o tackle this problem, we scaled the range of the AFEW -V A labels to [ − 1 , +1 ]. Moreover , differences were observed, due to the fact that the labels of the AFEW -V A are discrete, while the labels of the Af f-W ild are continuous. Figure 14 shows the discrete valence and arousal v alues of the annotations in AFEW -V A database, whereas Figure 15 shows the corresponding histogram in the 2-D valence & arousal space. W e then performed fine-tuning of the AffW ildNet to the AFEW -V A database and tested the performance of the gen- erated network. Similarly to [ 31 ], we used a 5-fold person- independent cross-validation strategy . T able 11 shows a com- parison of the performance of the fine-tuned Af fWildNet with the best results reported in [ 31 ]. Those results are in terms of the Pearson CC. It can be easily seen that the fine- tuned Af fW ildNet greatly outperformed the best method re- ported in [ 31 ]. For comparison purposes, we also trained a CNN net- work on the AFEW -V A database. This network’ s architec- ture was based on the conv olution and pooling layers of VGG-F ace followed by 2 fully connected layers with 4096 Fig. 14: Discrete values of annotations of the AFEW -V A database. Fig. 15: Histogram in the 2-D valence & arousal space of annotations of the AFEW -V A database. T able 11: Pearson Correlation Coefficient (Pearson CC) based ev aluation of valence & arousal predictions provided by the best architecture in [ 31 ] vs our Af fWildNet fine-tuned on the AFEW -V A. A higher Pearson CC value indicates a better performance. Group Pearson CC V alence Arousal best of [ 31 ] 0.407 0.45 Fine-tuned AffWildNet 0.514 0.575 and 2048 hidden units, respectively . As shown in T able 13 , the performance of the fine-tuned AffW ildNet, in terms of CCC, greatly outperformed this network as well. All these verify that our network can be used as a pre- trained one to yield excellent results across dif ferent dimen- sional databases. Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 17 T able 12: Accuracies on the EmotiW validation set obtained by different CNN and CNN-RNN architectures vs the fine-tuned AffW ildNet. A higher accuracy v alue indicates better performance. Architectures Accuracy Neutral Anger Disgust Fear Happ y Sad Surprise T otal VGG-16 0.327 0.424 0.102 0.093 0.476 0.138 0.133 0.263 VGG-16 + RNN 0.431 0.559 0.026 0.07 0.444 0.259 0.044 0.293 ResNet 0.31 0.153 0.077 0.023 0.534 0.207 0.067 0.211 ResNet + RNN 0.431 0.237 0.077 0.07 0.587 0.155 0.089 0.261 VGG-F ace + RNN 0.552 0.593 0.026 0.047 0.794 0.259 0.111 0.384 fine-tuned AffWildNet 0.569 0.627 0.051 0.023 0.746 0.709 0.111 0.454 T able 13: CCC based ev aluation of v alence & arousal pre- dictions provided by the CNN architecture based on VGG- Face and the fine-tuned Af fW ildNet on the AFEW -V A train- ing set. A higher CCC value indicate a better performance. CCC AFEW -V A V alence Arousal only CNN 0.44 0.474 Fine-tuned AffWildNet 0.515 0.556 5.2 Prior for Categorical Emotion Recognition 5.2.1 Experimental Results for the EmotiW dataset T o further sho w the strength of the Af fWildNet, we used the AffW ildNet - which is trained for dimensional emotion recognition task - in a very different problem, that of cat- egorical in-the-wild emotion recognition, focusing on the EmotiW 2017 Grand Challenge. T o tackle categorical emo- tion recognition, we modified the AffW ildNet’ s output layer to include 7 neurons (one for each basic emotion category) and performed fine-tuning on the AFEW 5.0 dataset. In the presented experiments, we compare the fine-tuned AffW ildNet’ s performance with that of other state-of-the-art CNN and CNN-RNN networks; the CNN part of which is based on the ResNet 50, VGG-16 and VGG-F ace architec- tures, trained on the same AFEW 5.0 dataset. The accuracies of all networks on the validation set of the EmotiW 2017 Grand Challenge are shown in T able 12 . A higher accuracy value indicates better performance for the model. W e can easily see that the AffW ildNet outperforms all those other networks in terms of total accuracy . W e should note that: (i) the AffW ildNet was trained to classify only video frames (and not audio) and then video classification based on frame aggregation w as performed (ii) the cropped faces provided by the challenge were only used (and not our own detection and/or normalization procedure) (iii) no data-augmentation, post-processing of the results or ensemble methodology hav e been conducted. It should also be mentioned that the fine-tuned AffW ildNet’ s performance, in terms of total accuracy , is: (i) much higher than the baseline total accuracy of 0.3881 reported in [ 16 ] (ii) better than all vanilla architectures’ performances that were reported by the three winning methods in the audio- video emotion recognition EmotiW 2017 Grand Chal- lenge [ 26 ] [ 28 ] [ 60 ] (iii) comparable and better in some cases than the rest of the results obtained by the three winning methods [ 26 ] [ 28 ] [ 60 ] The above are shown in T able 14 . Those results verify that the AffW ildNet can be appropriately fine-tuned and success- fully used for dimensional, as well as for categorical emo- tion recognition. 6 Conclusions and Future W ork Deep learning and deep neural networks hav e been success- fully used in the past years for facial expression and emo- tion recognition based on still image and video frame anal- ysis. Recent research focuses on in-the-wild facial analysis and refers either to categorical emotion recognition, target- ing recognition of the sev en basic emotion categories, or to dimensional emotion recognition, analyzing the valence- arousal (V -A) representation space. In this paper, we introduce Aff-W ild, a new , large in- the-wild database that consists of 298 videos of 200 sub- jects, with a total length of more than 30 hours. W e also present the Aff-W ild Challenge that was organized on Aff- W ild. W e report the results of the challenge, and the pitfalls and challenges in terms of predicting valence and arousal in- the-wild. Furthermore, we design a deep con volutional and recurrent neural architecture and perform extensiv e exper- imentation with the Aff-W ild database. W e show that the generated AffW ildNet pro vides the best performance for v a- lence and arousal estimation on the Aff-W ild dataset, both in terms of the Concordance Correlation Coefficient and the Mean Squared Error criteria, when compared with other deep learning networks trained on the same database. 18 Dimitrios K ollias ? et al. T able 14: Overall accuracies of the best architectures of the three winning methods of the EmotiW 2017 Grand Challenge reported on the validation set vs our fine-tuned Af fWildNet. A higher accurac y value indicates better performance. Group Architecture T otal Accuracy Original After Fine-T uning on FER2013 Data augmentation [ 26 ] DenseNet-121 HoloNet ResNet-50 0.414 0.41 0.418 - - [ 28 ] VGG-F ace FR-Net-A FR-Net-B FR-Net-C LSTM + FR-NET -B 0.379 0.337 0.334 0.376 - 0.483 0.446 0.488 0.452 0.465 - - - - 0.504 [ 60 ] W eighted C3D (no ov erlap) LSTM C3D (no ov erlap) VGG-F ace VGG-LSTM 1 layer - - 0.421 0.432 0.414 0.486 Our Fine-tuned AffW ildNet 0.454 - - Subsequently , we then demonstrate that the Af fW ildNet and Aff-W ild database constitute tools that can be used for facial e xpression and emotion recognition on other datasets. Using appropriate fine-tuning and retraining methodologies, we show that best results can be obtained by applying the AffW ildNet to other dimensional databases, including the RECOLA and the AFEW -V A ones and by comparing the ob- tained performances with other state-of-the-art pre-trained and fine-tuned networks. Furthermore, we observe that fine-tuning on the AffW ild- Net can produce state-of-the-art performance, not only for dimensional, but also for categorical emotion recognition. W e use this approach to tackle the facial expression and emotion recognition parts of the EmotiW 2017 Grand Chal- lenge, referring to recognition of the sev en basic emotion categories, finding that we produce comparable or better re- sults to the winners of this contest. It should be stressed that it is the first time, to the best of our knowledge, that the same deep architecture can be used for both types of dimensional and categorical emotion anal- ysis. T o achie ve this, the Af fW ildNet has been effecti vely trained with the largest existing, in-the-wild, database for continuous valence-arousal recognition (regression analysis problem) and then used for tackling the discrete se ven basic emotion recognition (classification) problem. The proposed procedure for fine-tuning the AffW ildNet can be applied to further extend its use in the analysis of other new visual emotion recognition datasets. This includes our current work on extending the Aff-W ild with new in-the- wild audiovisual information, as well as using it as a means for unifying dif ferent approaches to facial expression and emotion recognition. These approaches contain dimensional emotion representations, basic and compound emotion cate- gories, facial action unit representations, as well as specific emotion categories met in different contexts, such as nega- tiv e emotions, emotions in games, in social groups and other human machine (or robot) interactions. Acknowledgements The work of Stefanos Zafeiriou has been par- tially funded by the FiDiPro program of T ekes with project number 1849/31/2015. The work of Dimitris Kollias was funded by a T each- ing Fellowship of Imperial College London. The support of the EP- SRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (HiPEDS, Grant Reference EP/L016796/1) is gratefully acknowledged. W e also thank the NVIDIA Corporation for donating a T itan X GPU. W e would like also to acknowledge the contribution of the Y outube users that gav e us the permission to use their videos (especially Zalzar and Eddie from The1stT ake). W e wish to thank Dr A Dhall for providing us with the data of the Emotiw 2017 Grand Challenge. Additionally , we would like to thank the revie wers for their valuable comments that helped us to impro ve this paper . References 1. Abadi, M., Barham, P ., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: T ensorflow: a system for large-scale machine learning. In: OSDI, vol. 16, pp. 265–283 (2016) 2. Alabort-i-Medina, J., Antonakos, E., Booth, J., Snape, P ., Zafeiriou, S.: Menpo: A comprehensiv e platform for parametric image alignment and visual deformable models. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, pp. 679–682. A CM, Ne w Y ork, NY , USA (2014) 3. Albanie, S., V edaldi, A.: Learning grimaces by watching tv . In: Proceedings of the British Machine V ision Conference (BMVC) (2016) 4. Aung, M.S., Kaltwang, S., Romera-paredes, B., Martinez, B., Singh, A., Cella, M., V alstar , M.F ., Meng, H., Kemp, A., Elkins, A.C., T yler , N., W atson, P .J., W illiams, A.C., Pantic, M., Berthouze, N.: The automatic detection of chronic pain-related expression: requirements, challenges and a multimodal dataset. IEEE T ransactions on Affecti ve Computing (2016) 5. Bartlett, M.S., Littlew ort, G., Frank, M., Lainscsek, C., Fasel, I., Mov ellan, J.: Fully automatic facial action recognition in spon- taneous behavior . In: Automatic Face and Gesture Recognition, 2006. FGR 2006. 7th International Conference on, pp. 223–230. IEEE (2006) Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 19 6. Chang, W .Y ., Hsu, S.H., Chien, J.H.: Fatauva-net : An integrated deep learning framework for facial attribute recognition, action unit (au) detection, and v alence-arousal estimation. In: Proceed- ings of the IEEE Conference on Computer V ision and Pattern Recognition W orkshop (2017) 7. Chatfield, K., Simonyan, K., V edaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into conv olutional nets. arXiv preprint arXi v:1405.3531 (2014) 8. Chrysos, G.G., Antonakos, E., Snape, P ., Asthana, A., Zafeiriou, S.: A comprehensive performance e valuation of deformable face tracking in-the-wild. International Journal of Computer V ision 126 (2-4), 198–232 (2018) 9. Chung, J., Gulcehre, C., Cho, K., Bengio, Y .: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 10. Corneanu, C., Oliu, M., Cohn, J., Escalera, S.: Survey on rgb, 3d, thermal, and multimodal approaches for f acial expression recogni- tion: History , trends, and affect-related applications. IEEE trans- actions on pattern analysis and machine intelligence (2016) 11. Cowie, R., Cornelius, R.R.: Describing the emotional states that are expressed in speech. Speech communication 40 (1), 5–32 (2003) 12. Cowie, R., Douglas-Cowie, E., Savvidou*, S., McMahon, E., Sawe y , M., Schr ¨ oder , M.: ’feeltrace’: An instrument for record- ing percei ved emotion in real time. In: ISCA tutorial and research workshop (ITR W) on speech and emotion (2000) 13. Cowie, R., McKeown, G., Douglas-Cowie, E.: Tracing emotion: an overvie w . International Journal of Synthetic Emotions (IJSE) 3 (1), 1–17 (2012) 14. Dalgleish, T ., Power , M.: Handbook of cognition and emotion. John W iley & Sons (2000) 15. Deng, J., Dong, W ., Socher , R., Li, L.J., Li, K., Fei-Fei, L.: Im- agenet: A large-scale hierarchical image database. In: Computer V ision and Pattern Recognition, 2009. CVPR 2009. IEEE Confer- ence on, pp. 248–255. IEEE (2009) 16. Dhall, A., Goecke, R., Ghosh, S., Joshi, J., Hoey , J., Gedeon, T .: From individual to group-level emotion recognition: Emotiw 5.0. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 524–528. A CM (2017) 17. Dhall, A., Goecke, R., Joshi, J., Hoey , J., Gedeon, T .: Emotiw 2016: V ideo and group-lev el emotion recognition challenges. In: Proceedings of the 18th ACM International Conference on Multi- modal Interaction, pp. 427–432. A CM (2016) 18. Dhall, A., Goecke, R., Joshi, J., Sikka, K., Gedeon, T .: Emotion recognition in the wild challenge 2014: Baseline, data and pro- tocol. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 461–466. A CM (2014) 19. Dhall, A., Goeck e, R., Joshi, J., W agner , M., Gedeon, T .: Emotion recognition in the wild challenge 2013. In: Proceedings of the 15th A CM on International conference on multimodal interaction, pp. 509–516. A CM (2013) 20. Dhall, A., Ramana Murthy , O., Goecke, R., Joshi, J., Gedeon, T .: V ideo and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 ACM on Interna- tional Conference on Multimodal Interaction, pp. 423–426. ACM (2015) 21. Douglas-Cowie, E., Co wie, R., Cox, C., Amier, N., He ylen, D.K.: The sensitive artificial listner: an induction technique for generat- ing emotionally coloured conv ersation. In: LREC W orkshop on Corpora for Research on Emotion and Affect. ELRA (2008) 22. Gross, R., Matthe ws, I., Cohn, J., Kanade, T ., Bak er , S.: Multi-pie. Image and V ision Computing 28 (5), 807–813 (2010) 23. Hardoon, D.R., Szedmak, S., Shawe-T aylor , J.: Canonical corre- lation analysis; an ov ervie w with application to learning methods. T echnical report, Royal Hollow ay , Univ ersity of London (2003) 24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition, pp. 770–778 (2016) 25. Hochreiter , S., Schmidhuber, J.: Long short-term memory . Neural computation 9 (8), 1735–1780 (1997) 26. Hu, P ., Cai, D., W ang, S., Y ao, A., Chen, Y .: Learning supervised scoring ensemble for emotion recognition in the wild. In: Proceed- ings of the 19th A CM International Conference on Multimodal Interaction, pp. 553–560. A CM (2017) 27. Jung, H., Lee, S., Y im, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Pro- ceedings of the IEEE International Conference on Computer V i- sion, pp. 2983–2991 (2015) 28. Knyaze v , B., Shvetso v , R., Efremov a, N., Kuharenko, A.: Con- volutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprint arXiv:1711.04598 (2017) 29. K oelstra, S., Muhl, C., Soleymani, M., Lee, J.S., Y azdani, A., Ebrahimi, T ., Pun, T ., Nijholt, A., Patras, I.: Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affecti ve Computing 3 (1), 18–31 (2012) 30. K ollias, D., Nicolaou, M., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: Proceedings of the IEEE Conference on Computer V ision and P at- tern Recognition W orkshop (2017) 31. K ossaifi, J., Tzimiropoulos, G., T odorovic, S., Pantic, M.: Afew- va database for valence and arousal estimation in-the-wild. Image and V ision Computing (2017) 32. Lawrence, I., Lin, K.: A concordance correlation coefficient to ev aluate reproducibility . Biometrics pp. 255–268 (1989) 33. Lee, A.: W elcome to virtualdub . org!-virtualdub . org (2002) 34. Li, J., Chen, Y ., Xiao, S., Zhao, J., Roy , S., Feng, J., Y an, S., Sim, T .: Estimation of affecti ve lev el in the wild with multiple memory networks. In: Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition W orkshop (2017) 35. Lucey , P ., Cohn, J.F ., Kanade, T ., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): A com- plete dataset for action unit and emotion-specified expression. In: Computer V ision and Pattern Recognition W orkshops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 94–101. IEEE (2010) 36. Lucey , P ., Cohn, J.F ., Prkachin, K.M., Solomon, P .E., Matthews, I.: Painful data: The unbc-mcmaster shoulder pain e xpression archive database. In: Automatic Face & Gesture Recognition and W ork- shops (FG 2011), 2011 IEEE International Conference on, pp. 57– 64. IEEE (2011) 37. Mahoor , M., Hasani, B.: Facial af fect estimation in the wild using deep residual and conv olutional networks. In: Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition W orkshop (2017) 38. Mathias, M., Benenson, R., Pedersoli, M., V an Gool, L.: Face de- tection without bells and whistles. In: European Conference on Computer V ision, pp. 720–735. Springer (2014) 39. McKeo wn, G., V alstar, M., Cowie, R., Pantic, M., Schr ¨ oder , M.: The semaine database: Annotated multimodal records of emotion- ally colored conv ersations between a person and a limited agent. Affecti ve Computing, IEEE Transactions on 3 (1), 5–17 (2012) 40. More, A.: Survey of resampling techniques for improving clas- sification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048 (2016) 41. Pantic, M., V alstar, M., Rademaker , R., Maat, L.: W eb-based database for facial expression analysis. In: Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pp. 5–pp. IEEE (2005) 42. Parkhi, O.M., V edaldi, A., Zisserman, A.: Deep face recognition. In: BMVC, vol. 1, p. 6 (2015) 43. Plutchik, R.: Emotion: A psychoev olutionary synthesis. Harper- collins College Di vision (1980) 44. Ringev al, F ., Schuller, B., V alstar , M., Co wie, R., Pantic, M.: A vec 2015: The 5th international audio/visual emotion challenge and 20 Dimitrios K ollias ? et al. workshop. In: Proceedings of the 23rd ACM international confer- ence on Multimedia, pp. 1335–1336. A CM (2015) 45. Ringev al, F ., Schuller, B., V alstar, M., Gratch, J., Cowie, R., Scherer , S., Mozgai, S., Cummins, N., Schmi, M., Pantic, M.: A vec 2017–real-life depression, and affect recognition workshop and challenge (2017) 46. Ringev al, F ., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborati ve and af fecti ve interactions. In: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and W orkshops on, pp. 1–8. IEEE (2013) 47. Russell, J.A.: Evidence of conv ergent validity on the dimensions of affect. Journal of personality and social psychology 36 (10), 1152 (1978) 48. Sariyanidi, E., Gunes, H., Cav allaro, A.: Automatic analysis of facial affect: A survey of registration, representation, and recog- nition. Pattern Analysis and Machine Intelligence, IEEE T ransac- tions on 37 (6), 1113–1133 (2015) 49. Schuller , B., V alstar, M., Eyben, F ., McKeo wn, G., Cowie, R., Pantic, M.: A vec 2011–the first international audio/visual emotion challenge. In: Affecti ve Computing and Intelligent Interaction, pp. 415–424. Springer (2011) 50. Schuller , B., V alster, M., Eyben, F ., Cowie, R., Pantic, M.: A vec 2012: the continuous audio/visual emotion challenge. In: Proceed- ings of the 14th A CM international conference on Multimodal in- teraction, pp. 449–456. A CM (2012) 51. Simonyan, K., Zisserman, A.: V ery deep conv olutional networks for large-scale image recognition. arXiv:1409.1556 (2014) 52. Sneddon, I., McRorie, M., McKeo wn, G., Hanratty , J.: The belfast induced natural emotion database. IEEE Transactions on Affecti ve Computing 3 (1), 32–41 (2012) 53. Soleymani, M., Lichtenauer, J., Pun, T ., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Trans- actions on Affecti ve Computing 3 (1), 42–55 (2012) 54. Szegedy , C., Ioffe, S., V anhoucke, V ., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learn- ing. In: AAAI, vol. 4, p. 12 (2017) 55. T ian, Y .l., Kanade, T ., Cohn, J.F .: Recognizing action units for facial expression analysis. Pattern Analysis and Machine Intelli- gence, IEEE T ransactions on 23 (2), 97–115 (2001) 56. V alstar , M., Gratch, J., Schuller , B., Ringev al, F ., Lalanne, D., T or- res T orres, M., Scherer, S., Stratou, G., Cowie, R., Pantic, M.: A vec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International W orkshop on Audio/V isual Emotion Challenge, pp. 3–10. ACM (2016) 57. V alstar , M., Pantic, M.: Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In: Proc. 3rd Intern. W orkshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, p. 65 (2010) 58. V alstar , M., Schuller, B., Smith, K., Almaev , T ., Eyben, F ., Kra- jewski, J., Cowie, R., Pantic, M.: A vec 2014: 3d dimensional af- fect and depression recognition challenge. In: Proceedings of the 4th International W orkshop on Audio/V isual Emotion Challenge, pp. 3–10. A CM (2014) 59. V alstar , M., Schuller , B., Smith, K., Eyben, F ., Jiang, B., Bilakhia, S., Schnieder, S., Cowie, R., Pantic, M.: A vec 2013: the contin- uous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on Au- dio/visual emotion challenge, pp. 3–10. A CM (2013) 60. V ielzeuf, V ., Pateux, S., Jurie, F .: T emporal multimodal fusion for video emotion classification in the wild. arXiv preprint arXiv:1709.07200 (2017) 61. Whissel, C.: The dictionary of affect in language, emotion: The- ory , research and experience: vol. 4, the measurement of emotions, r . Plutchik and H. Kellerman, Eds., Ne w Y ork: Academic (1989) 62. Y in, L., Chen, X., Sun, Y ., W orm, T ., Reale, M.: A high-resolution 3d dynamic facial expression database. In: Automatic Face & Gesture Recognition, 2008. FG’08. 8th IEEE International Con- ference On, pp. 1–6. IEEE (2008) 63. Y in, L., W ei, X., Sun, Y ., W ang, J., Rosato, M.J.: A 3d facial ex- pression database for facial beha vior research. In: Automatic face and gesture recognition, 2006. FGR 2006. 7th international con- ference on, pp. 211–216. IEEE (2006) 64. Y ouT ube, L.: Y outube. Retriev ed 27 , 2011 (2011) 65. Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., K otsia, I.: Aff-wild: V alence and arousal in-the- wildchallenge. In: Computer V ision and P attern Recognition W orkshops (CVPRW), 2017 IEEE Conference on, pp. 1980–1987. IEEE (2017) 66. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T .S.: A survey of af- fect recognition methods: Audio, visual, and spontaneous expres- sions. Pattern Analysis and Machine Intelligence, IEEE Transac- tions on 31 (1), 39–58 (2009) A A ppendix A.1 Baseline: CNN-M The exact structure of the network is sho wn in T able 15 . In total, it con- sists of 5 conv olutional, batch normalization and pooling layers and 2 fully connected (FC) ones. F or each con volutional layer the parameters are the filter and the stride, in the form of (filter height, filter width, input channels , output channels/feature maps) and (1, stride height, stride width , 1), respectively , and for the max pooling layer the pa- rameters are the ksize and stride, in the form of (pooling height, pool- ing width, input channels, output channels) and (1, stride height, stride width , 1), respectively . W e follow the T ensorFlow’ s platform notation for the v alues of all those parameters. Note that the acti v ation function in the conv olutional and batch normalization layers is the ReLU one; this is also the case in the first FC layer . The activation function of the second FC layer , which is the output layer , is a linear one. T able 15: Baseline architecture based on CNN-M, showing the val- ues of the parameters of the con v olutional and pooling layers and the number of hidden units in the fully connected layers. W e follow the T ensorFlow’ s platform notation for the values of all those parameters. Layer filter ksize stride padding no of units conv 1 [7, 7, 3, 96] [1, 2, 2, 1] ’V ALID’ batch norm max pooling [1, 3, 3, 1] [1, 2, 2, 1] ’V ALID’ conv 2 [5, 5, 96, 256] [1, 2, 2, 1] ’SAME’ batch norm max pooling [1, 3, 3, 1] [1, 2, 2, 1] ’SAME’ conv 3 [3, 3, 256, 512] [1, 1, 1, 1] ’SAME’ batch norm conv 4 [3, 3, 512, 512] [1, 1, 1, 1] ’SAME’ batch norm conv 5 [3, 3, 512, 512] [1, 1, 1, 1] ’SAME’ batch norm max pooling [1, 2, 2, 1] [1, 2, 2, 1] ’SAME’ fully connected 1 4096 fully connected 2 2 A.2 ResNet-50 Residual learning is adopted in these models by stacking multiple blocks of the form: o k = B ( x k , { W k } ) + h ( x k ) , (4) where x k , W k and o k indicate the input, the weights, and the out- put of layer k , respectively , B indicates the residual function that is Deep Affect Prediction in-the-wild: Af f-Wild Database and Challenge, Deep Architectures, and Beyond 21 1500 256 V A F ull y Con nected 7x7 c onv, 3,64 x2 x3 x5 x2 3x3 c onv, 64,6 4 1x1 c onv, 64,2 5 6 1x1 c onv, 64,64 3x3 c onv, 64,6 4 1x1 c onv, 64,2 5 6 1x1 c onv, 256, 6 4 3x3 c onv, 128, 1 28 1x1 c onv, 128, 5 12 1x1 c onv, 256, 1 28 3x3 c onv, 128, 1 28 1x1 c onv, 128, 5 12 1x1 c onv, 512, 1 28 3x3 c onv, 256, 2 56 1x1 c onv, 256, 1 024 1x1 c onv, 512, 2 56 3x3 c onv, 256, 2 56 1x1 c onv, 256, 1 024 1x1 c onv, 1024,256 3x3 c onv, 2048,512 1x1 c onv, 512, 2 048 1x1 c onv, 2048,512 3x3 c onv, 512,512 1x1 c onv, 512,2 048 1x1 c onv, 1024,512 avg poo l R esNet-50 max pool Fig. 16: The CNN-only architecture for v alence and arousal estimation, based on ResNet-50 structure and including two fully connected layers (V and A stand for valence and arousal respecti vely). Each con volutional layer is in the format: filter height × filter width, number of input feature maps, number of output feature maps. Fig. 17: The CNN-only architecture for valence and arousal estimation, based on VGG-Face structure (V and A stand for valence and arousal respecti vely). learnt and h is the identity mapping between the residual function and the input. The h identity mapping is a projection of x k to match the dimensions of B ( x k , { W k } ) (done by 1 × 1 conv olutions), as in [ 24 ]. The first layer of the ResNet-50 model is comprised of a 7 × 7 con volutional layer with 64 feature maps, followed by a max pooling layer of size 3 × 3 . Next, there are 4-bottleneck blocks, where a shortcut connection is added after each block. Each of these blocks is comprised of 3 con volutional layers of sizes 1 × 1 , 3 × 3 , and 1 × 1 with dif ferent number of feature maps. The architecture of the network is depicted in Figure 16 . Each con- volutional layer is in the format: filter height × filter width, number of input feature maps, number of output feature maps. A.3 VGG-F ace/VGG-16 T able 16 shows the configuration of the CNN architecture based on VGG-F ace or VGG-16. In total, it is composed of thirteen con volu- tional and pooling layers and three fully connected ones. For all those layers the form of the parameters is the same as described abo ve in the baseline architecture. W e follow the T ensorFlow’ s platform notation for the values of all those parameters. The output number of units is also shown in the T able. A linear activ ation function was used in the last FC layer , pro- viding the final estimates. All units in the remaining FC layers were equipped with the ReLU. Dropout has been added after the first FC layer in order to avoid over-fitting. The architecture of the network is depicted in Figure 17 . T able 16: CNN architecture based on VGG-Face/V GG-16, showing the values of the parameters of the conv olutional and pooling layers and the number of hidden units in the fully connected layers. W e follo w the T ensorFlow’ s platform notation for the values of all those parameters. Layer filter ksize stride padding no of units con v 1 [3, 3, 3, 64] [1, 1, 1, 1] ’SAME’ con v 2 [3, 3, 64, 64] [1, 1, 1, 1] ’SAME’ max pooling [1, 2, 2, 1] [1, 2, 2, 1] ’SAME’ con v 3 [3, 3, 64, 128] [1, 1, 1, 1] ’SAME’ con v 4 [3, 3, 128, 128] [1, 1, 1, 1] ’SAME’ max pooling [1, 2, 2, 1] [1, 2, 2, 1] ’SAME’ con v 5 [3, 3, 128, 256] [1, 1, 1, 1] ’SAME’ con v 6 [3, 3, 256, 256] [1, 1, 1, 1] ’SAME’ con v 7 [3, 3, 256, 256] [1, 1, 1, 1] ’SAME’ max pooling [1, 2, 2, 1] [1, 2, 2, 1] ’SAME’ con v 8 [3, 3, 256, 512] [1, 1, 1, 1] ’SAME’ con v 9 [3, 3, 512, 512] [1, 1, 1, 1] ’SAME’ con v 10 [3, 3, 512, 512] [1, 1, 1, 1] ’SAME’ max pooling [1, 2, 2, 1] [1, 2, 2, 1] ’SAME’ con v 11 [3, 3, 512, 512] [1, 1, 1, 1] ’SAME’ con v 12 [3, 3, 512, 512] [1, 1, 1, 1] ’SAME’ con v 13 [3, 3, 512, 512] [1, 1, 1, 1] ’SAME’ max pooling [1, 2, 2, 1] [1, 2, 2, 1] ’SAME’ fully connected 1 4096 dropout fully connected 2 4096 fully connected 3 2
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment