Surpassing Human-Level Face Verification Performance on LFW with GaussianFace

Surpassing Human-Le vel F ace V eriﬁcation P erf ormance on LFW with GaussianF ace Chaochao Lu Xiaoou T ang Department of Information Engineering, The Chinese Univ ersity of Hong K ong { lc013, xtang } @ie.cuhk.edu.hk Abstract F ace veriﬁcation remains a challenging pr oblem in very complex conditions with lar ge variations such as pose, illumination, expr ession, and occlusions. This pr oblem is e xacerbated when we rely unr ealistically on a single training data source , which is often insufﬁcient to cover the intrinsically complex face variations. This paper pr o- poses a principled multi-task learning appr oac h based on Discriminative Gaussian Process Latent V ariable Model, named GaussianF ace, to enric h the diversity of training data. In comparison to existing methods, our model exploits additional data fr om multiple sour ce-domains to impr ove the g eneralization performance of face veriﬁcation in an unknown target-domain. Importantly , our model can adapt automatically to complex data distributions, and ther efor e can well captur e comple x face variations inher ent in multiple sources. Extensive experiments demonstrate the effectiveness of the pr oposed model in learning fr om diverse data sources and generalize to unseen domain. Speciﬁcally , the accuracy of our algorithm achieves an impr essive accuracy rate of 98.52% on the well-known and challenging Labeled F aces in the W ild (LFW) benchmark [23]. F or the ﬁrst time, the human-level performance in face veriﬁcation (97.53%) [28] on LFW is surpassed. 1 1. Introduction Face veriﬁcation, which is the task of determining whether a pair of face images are from the same person, has been an acti ve research topic in computer vision for decades [28, 22, 46, 5, 47, 31, 14, 9]. It has many important applications, including surveillance, access control, image retriev al, and automatic log-on for personal computer or mobile de vices. Howe v er , various visual complications deteriorate the performance of face veriﬁcation, as shown by numerous studies on real-world face images from the wild [23]. The Labeled Faces in the Wild (LFW) dataset 1 For project update, please refer to mmlab .ie.cuhk.edu.hk. is well kno wn as a challenging benchmark for face ver- iﬁcation. The dataset provides a large set of relati vely unconstrained f ace images with complex v ariations in pose, lighting, expression, race, ethnicity , age, gender , clothing, hairstyles, and other parameters. Not surprisingly , LFW has proven dif ﬁcult for automatic face veriﬁcation methods [23, 28]. Although there has been signiﬁcant work [22, 9, 5, 14, 47, 13, 59, 50, 51, 53] on LFW and the accuracy rate has been improved from 60.02% [56] to 97.35% [53] since LFW is established in 2007, these studies hav e not closed the gap to human-le vel performance [28] in face veriﬁcation. Why could not we surpass the human-lev el perfor- mance? T wo possible reasons are found as follo ws: 1) Most existing face veriﬁcation methods assume that the training data and the test data are drawn from the same feature space and follo w the same distribution. When the distribution changes, these methods may suffer a large performance drop [58]. Howe ver , many practical scenarios in volve cross-domain data drawn from dif ferent facial appearance distributions. Learning a model solely on a single source data often leads to overﬁtting due to dataset bias [55]. Moreover , it is difﬁcult to collect sufﬁcient and necessary training data to rebuild the model in new scenarios, for highly accurate face veriﬁcation speciﬁc to the target domain. In such cases, it becomes critical to exploit more data from multiple source-domains to impro ve the generalization of face veriﬁcation methods in the target- domain. 2) Modern face veriﬁcation methods are mainly divided into two categories: extracting lo w-level features [36, 3, 34, 10, 24], and building classiﬁcation models [62, 50, 13, 37, 31, 56, 5, 28, 47, 33]. Although these existing methods hav e made great progress in f ace veriﬁcation, most of them are less ﬂexible when dealing with complex data distributions. For the methods in the ﬁrst category , for example, low-le vel features such as SIFT [36], LBP [3], and Gabor [34] are handcrafted. Even for features learned from data [10, 24], the algorithm parameters (such as the depth of random projection tree, or the number of centers 1 in k-means) also need to be speciﬁed by users. Similarly , for the methods in the second category , the architectures of deep networks in [62, 50, 63, 51] (for example, the number of layers, the number of nodes in each layer, etc.), and the parameters of the models in [31, 5, 28, 47] (for example, the number of Gaussians, the number of classiﬁers, etc.) must also be determined in advance. Since most existing methods require some assumptions to be made about the structures of the data, they cannot work well when the assumptions are not valid. Moreover , due to the e xistence of the assumptions, it is hard to capture the intrinsic structures of data using these methods. T o this end, we propose the Multi-T ask Learning ap- proach based on Discriminati ve Gaussian Process Latent V ariable Model (DGPL VM) [57], named GaussianF ace , for face veriﬁcation. Unlike most existing studies [22, 5, 14, 47, 13] that rely on a single training data source, in order to take advantage of more data from multiple source-domains to improv e the performance in the target-domain, we introduce the multi-task learning constraint to DGPL VM. Here, we in vestigate the asymmetric multi-task learning because we only focus on the performance improvement of the target task. From the perspecti ve of information theory , this constraint aims to maximize the mutual information be- tween the distributions of target-domain data and multiple source-domains data. Moreover , the GaussianFace model is a reformulation based on the Gaussian Processes (GPs) [42], which is a non-parametric Bayesian kernel method. Therefore, our model also can adapt its complexity ﬂexibly to the complex data distributions in the real-world, without any heuristics or manual tuning of parameters. Reformulating GPs for large-scale multi-task learning is non-trivial. T o simplify calculations, we introduce a more efﬁcient equiv alent form of Kernel Fisher Discriminant Analysis (KFD A) to DGPL VM. Despite that the Gaussian- Face model can be optimized effecti vely using the Scaled Conjugate Gradient (SCG) technique, the inference is slow for large-scale data. W e make use of GP approximations [42] and anchor graphs [35] to speed up the process of inference and prediction, so as to scale our model to large- scale data. Our model can be applied to face veriﬁcation in two different ways: as a binary classiﬁer and as a feature extractor . In the former mode, giv en a pair of face images, we can directly compute the posterior likelihood for each class to make a prediction. In the latter mode, our model can automatically e xtract high-dimensional features for each pair of face images, and then feed them to a classiﬁer to make the ﬁnal decision. The main contributions of this paper are as follo ws: • W e propose a novel GaussianFace model for f ace veriﬁcation by virtue of the multi-task learning con- straint to DGPL VM. Our model can adapt to complex distributions, avoid over -ﬁtting, exploit discriminati ve information, and take adv antage of multiple source- domains data. • W e introduce a computationally more ef ﬁcient equi va- lent form of KFDA to DGPL VM. This equiv alent form reformulates KFDA to the kernel version consistent with the co variance function in GPs, which greatly simpliﬁes calculations. • W e introduce approximation in GPs and anchor graphs to speed up the process of inference and prediction. • W e achiev e superior performance on the challeng- ing LFW benchmark [23], with an accurac y rate of 98.52%, beyond human-level performance reported in [28]. 2. Related W ork Human and computer performance on face recognition has been compared e xtensiv ely [40, 38, 2, 54, 41, 8]. These studies ha ve shown that computer-based algorithms were more accurate than humans in well-controlled environ- ments (e.g., frontal vie w , natural expression, and controlled illumination), whilst still comparable to humans in the poor condition (e.g., frontal vie w , natural expression, and uncontrolled illumination). Howe ver , the abov e conclusion is only veriﬁed on face datasets with controlled variations, where only one factor changes at a time [40, 38]. T o date, there has been virtually no work showing that computer - based algorithms could surpass human performance on unconstrained face datasets, such as LFW , which exhibits natural (multifactor) variations in pose, lighting, expression, race, ethnicity , age, gender, clothing, hairstyles, and other parameters. There has been much work dealing with multifactor variations in face v eriﬁcation. For e xample, Simonyan et al. applied the Fisher v ector to face veriﬁcation and achiev ed a good performance [47]. Ho wev er , the Fisher vector is deri ved from the Gaussian mixture model (GMM), where the number of Gaussians need to be speciﬁed by users, which means it cannot co ver complex data auto- matically . Li et al. proposed a non-parametric subspace analysis [33, 32], b ut it is only a linear transformation and cannot cover the complex distributions. Besides, there also e xist some approaches for utilizing plentiful source- domain data. Based on the Joint Bayesian algorithm [13], Cao et al. proposed a transfer learning approach [9] by merging source-domain data with limited target-domain data. Since this transfer learning approach is based on the joint Bayesian model of original visual features, it is not suitable for handling the complex nonlinear data and the data with complex manifold structures. Moreov er , the transfer learning approach in [9] only considered two 2 different domains, restricting its wider applications in large- scale data from multiple domains. More recently , Zhu et al. [63] learned the transformation from face images under various poses and lighting conditions to a canonical view with a deep con volutional network. Sun et al. [51] learned face representation with a deep model through face identiﬁcation, which is a challenging multi-class prediction task. T aigman et al. [52] ﬁrst utilized e xplicit 3D face modeling to apply a piecewise afﬁne transformation, and then derived a face representation from a nine-layer deep neural network. Although these methods ha ve achie ved high performances on LFW , many parameters of them must be determined in advance so that the y are less ﬂexible when dealing with complex data distrib utions. The core of our algorithm is GPs. T o the best of our knowledge, GPs methods and Multi-task learning with re- lated GPs methods (MTGP) hav e not been applied for face veriﬁcation. Actually , MTGP/GPs hav e been extensi vely studied in machine learning and computer vision in recent years [6, 60, 11, 25, 30, 44, 49, 61, 26]. Howe ver , most of them [60, 11, 6, 44, 25, 49, 61] have only considered the symmetric multi-task learning, which means that all tasks have been assumed to be of equal importance, whereas our purpose is to enhance performance on a target task giv en all other source tasks. Leen et al. proposed a MTGP model in the asymmetric setting [30] to focus on improving performance on the target task, and Kim et al. dev eloped a GP model for clustering [26], but their methods do not take the discriminati ve information of the cov ariance function into special account like DGPL VM. Although the discriminativ e information is considered in [57], it does not apply multi-task learning to improv e its performance. Salakhutdinov et al. used a deep belief net to learn a good cov ariance kernel for GPs [45]. The limitation of such deep methods is that it is hard to determine which architecture for this network is optimal. Also, multi-task learning constraint was not considered in [45]. 3. Preliminary In this section, we brieﬂy re view Gaussian Processes (GPs) for classiﬁcation and clustering [26], and Gaussian Process Latent V ariable Model (GPL VM) [29]. W e use GPs method mainly due to the follo wing three notable advantages. Firstly , as mentioned previously , it is a non- parametric method, which means it adapts its complexity ﬂexibly to the complex data distributions in the real-world, without any heuristics or manual tuning of parameters. Secondly , GPs method can be computed effecti vely because of its closed-form marginal probability computation. Fur- thermore, its hyper -parameters can be learned from data automatically without using model selection methods such as cross validation, thereby av oiding the high computational cost. Thirdly , the inference of GPs is based on Bayesian rules, resulting in robustness to ov erﬁtting. W e recommend Rasmussen and W illiams’ s excellent monograph for further reading [42]. 3.1. Gaussian Processes f or Binary Classiﬁcation Formally , for two-class classiﬁcation, suppose that we hav e a training set D of N observations, D = { ( x i , y i ) } N i =1 , where the i -th input point x i ∈ R D and its corresponding output y i is binary , with y = 1 i for one class and y i = − 1 for the other . Let X be the N × D matrix, where the row vectors represent all n input points, and y be the column vector of all n outputs. W e deﬁne a latent variable f i for each input point x i , and let f = [ f 1 , . . . , f N ] > . A sigmoid function π ( · ) is imposed to squash the output of the latent function into [0 , 1] , π ( f i ) = p ( y i = 1 | f i ) . Assuming the data set is i.i.d, then the joint likelihood factorizes to p ( y | f ) = N Y i =1 p ( y i | f i ) = N Y i =1 π ( y i f i ) . (1) Moreov er , the posterior distrib ution o ver latent functions is p ( f | X , y , θ ) = p ( y | f ) p ( f | X ) p ( y | X , θ ) . (2) Since neither p ( f | X , y , θ ) nor p ( y | f ) can be computed analytically , the Laplace method is utilized to approximate the posterior p ( f | X , y , θ ) = N ( ˆ f , ( K − 1 + W ) − 1 ) , (3) where ˆ f = arg max f p ( f | X , y , θ ) and W = − OO log p ( f | X , y , θ ) | f = ˆ f . Then, we can obtain log p ( y | X , θ ) = − 1 2 ˆ f > K − 1 ˆ f + log p ( y | ˆ f ) − 1 2 log | B | . (4) where | B | = | K | · | K − 1 + W | = | I n + W 1 2 KW 1 2 | . The optimal value of θ can be acquired by using the gradient method to maximize Equation (4). Giv en any unseen test point x ∗ , the probability of its latent function f ∗ is f ∗ | X , y , x ∗ ∼ N ( K ∗ K − 1 ˆ f , K ∗∗ − K ∗ ˜ K − 1 K > ∗ ) , (5) where ˜ K = K + W − 1 . Finally , we squash f ∗ to ﬁnd the probability of class membership as follows ¯ π ( f ∗ ) = Z π ( f ∗ ) p ( f ∗ | X , y , x ∗ ) d f ∗ . (6) 3.2. Gaussian Processes f or Clustering The principle of GP clustering is based on the key ob- servation that the variances of predictive v alues are smaller in dense areas and larger in sparse areas. The v ariances can be employed as a good estimate of the support of a 3 probability density function, where each separate support domain can be considered as a cluster . This observ ation can be explained from the variance function of any predictive data point x ∗ σ 2 ( x ∗ ) = K ∗∗ − K ∗ ˜ K − 1 K > ∗ . (7) If x ∗ is in a sparse region, then K ∗ ˜ K − 1 K > ∗ becomes small, which leads to large variance σ 2 ( x ∗ ) , and vice v ersa. Another good property of Equation (7) is that it does not depend on the labels, which means it can be applied to the unlabeled data. T o perform clustering, the following dynamic system associated with Equation (7) can be written as F ( x ) = − O σ 2 ( x ) . (8) The theorem in [26] guarantees that almost all the tra- jectories approach one of the stable equilibrium points detected from Equation (8). After each data point ﬁnds its corresponding stable equilibrium point, we can employ a complete graph [4, 26] to assign cluster labels to data points with the stable equilibrium points. Obviously , the variance function in Equation (7) completely determines the performance of clustering. 3.3. Gaussian Process Latent V ariable Model Let Z = [ z 1 , . . . , z N ] > denote the matrix whose ro ws represent corresponding positions of X in latent space, where z i ∈ R d ( d  D ). The Gaussian Process Latent V ariable Model (GPL VM) can be interpreted as a Gaussian process mapping from a low dimensional latent space to a high dimensional data set, where the locale of the points in latent space is determined by maximizing the Gaussian process likelihood with respect to Z . Given a cov ariance function for the Gaussian process, denoted by k ( · , · ) , the likelihood of the data given the latent positions is as follows, p ( X | Z , θ ) = 1 p (2 π ) N D | K | D exp  − 1 2 tr ( K − 1 XX > )  , (9) where K i,j = k ( z i , z j ) . Therefore, the posterior can be written as p ( Z , θ | X ) = 1 Z a p ( X | Z , θ ) p ( Z ) p ( θ ) , (10) where Z a is a normalization constant, the uninformati ve priors o ver θ , and the simple spherical Gaussian priors ov er Z are introduced [57]. T o obtain the optimal θ and Z , we need to optimize the abo ve likelihood (10) with respect to θ and Z , respectiv ely . 4. GaussianF ace In order to automatically learn discriminati ve features or cov ariance function, and to take advantage of source- domain data to improv e the performance in face veriﬁ- cation, we de velop a principled GaussianFace model by including the multi-task learning constraint into Discrimi- nativ e Gaussian Process Latent V ariable Model (DGPL VM) [57]. 4.1. DGPL VM Reformulation The DGPL VM is an extension of GPL VM, where the dis- criminativ e prior is placed over the latent positions, rather than a simple spherical Gaussian prior . The DGPL VM uses the discriminativ e prior to encourage latent positions of the same class to be close and those of different classes to be far . Since face veriﬁcation is a binary classiﬁcation problem and the GPs mainly depend on the kernel function, it is natural to use Kernel Fisher Discriminant Analysis (KFD A) [27] to model class structures in k ernel spaces. For simplicity of inference in the follo wings, we introduce another equiv alent formulation of KFD A to replace the one in [57]. KFD A is a kernelized version of linear discriminant analysis method. It ﬁnds the direction deﬁned by a kernel in a feature space, onto which the projections of positive and negati ve classes are well separated by maximizing the ratio of the between-class variance to the within-class variance. Formally , let { z 1 , . . . , z N + } denote the positive class and { z N + +1 , . . . , z N } the negati ve class, where the numbers of positiv e and negati ve classes are N + and N − = N − N + , respectiv ely . Let K be the kernel matrix. Therefore, in the feature space, the two sets { φ K ( z 1 ) , . . . , φ K ( z N + ) } and { φ K ( z N + +1 ) , . . . , φ K ( z N ) } represent the positive class and the negati ve class, respectiv ely . The optimization criterion of KFD A is to maximize the ratio of the between- class variance to the within-class v ariance J ( ω , K ) = ( w > ( µ + K − µ − K )) 2 w > ( Σ + K + Σ − K + λ I N ) w , (11) where λ is a positiv e regularization parameter , µ + K = 1 N + P N + i =1 φ K ( z i ) , µ − K = 1 N − P N i = N + +1 φ K ( z i ) , Σ + K = 1 N + P N + i =1 ( φ K ( z i ) − µ + K )( φ K ( z i ) − µ + K ) > , and Σ − K = 1 N − P N i = N + +1 ( φ K ( z i ) − µ − K )( φ K ( z i ) − µ − K ) > . In this paper, howe ver , we focus on the cov ariance function rather than the latent positions. T o simplify calculations, we represent Equation (11) with the kernel function, and let the kernel function ha ve the same form as the cov ariance function. Therefore, it is natural to introduce a more efﬁcient equi v alent form of KFDA with certain assumptions as Kim et al. points out [27], i.e., maximizing Equation (11) is equiv alent to maximizing the follo wing 4 equation J ∗ = 1 λ  a > Ka − a > KA ( λ I n + AKA ) − 1 AKa  , (12) where a =[ 1 n + 1 > N + , − 1 N − 1 > N − ] A = diag  1 p N +  I N + − 1 N + 1 N + 1 > N +  , 1 p N −  I N − − 1 N − 1 N − 1 > N −   . Here, I N denotes the N × N identity matrix and 1 N denotes the length- N vector of all ones in R N . Therefore, the discriminativ e prior over the latent posi- tions in DGPL VM can be written as p ( Z ) = 1 Z b exp  − 1 σ 2 J ∗  , (13) where Z b is a normalization constant, and σ 2 represents a global scaling of the prior . The cov ariance matrix obtained by DGPL VM is discrim- inativ e and more ﬂexible than the one used in conv entional GPs for classiﬁcation (GPC), since they are learned based on a discriminative criterion, and more degrees of freedom are estimated than con ventional kernel hyper -parameters. 4.2. Multi-task Learning Constraint From an asymmetric multi-task learning perspecti ve, the tasks should be allowed to share common hyper -parameters of the covariance function. Moreover , from an information theory perspecti ve, the information cost between target task and multiple source tasks should be minimized. A natural way to quantify the information cost is to use the mutual entropy , because it is the measure of the mutual dependence of two distributions. For multi-task learning, we extend the mutual entropy to multiple distrib utions as follows M = H ( p t ) − 1 S S X i =1 H ( p t | p i ) , (14) where H ( · ) is the mar ginal entropy , H ( ·|· ) is the conditional entropy , S is the number of source tasks, { p i } S i =1 , and p t are the probability distrib utions of source tasks and target task, respectiv ely . 4.3. GaussianF ace Model In this section, we describe our GaussianFace model in detail. Suppose we hav e S source-domain datasets { X 1 , . . . , X S } and a target-domain data X T . For each source-domain data or target-domain data X i , according to Equation (9), we write its marginal lik elihood p ( X i | Z i , θ ) = 1 p (2 π ) N D | K | D exp  − 1 2 tr ( K − 1 X i X > i )  . (15) where Z i represents the domain-relev ant latent space. For each source-domain data and target-domain data, their cov ariance functions K have the same form because they share the same hyper-parameters θ . In this paper , we use a widely used kernel K i,j = k θ ( z i , z j ) = θ 0 exp  − 1 2 d X m =1 θ m ( z m i − z m j ) 2  + θ d +1 + δ z i , z j θ d +2 , (16) where θ = { θ i } d +2 i =0 and d is the dimension of the data point. Then, from Equations (10), learning the DGPL VM is equiv alent to optimizing p ( Z i , θ | X i ) = 1 Z a p ( X i | Z i , θ ) p ( Z i ) p ( θ ) , (17) where p ( X i | Z i , θ ) and p ( Z i ) are respectively represented in (15) and (13). According to the multi-task learning constraint in Equation (14), we can attain M = H ( p ( Z T , θ | X T )) − 1 S S X i =1 H ( p ( Z T , θ | X T ) | p ( Z i , θ | X i )) . (18) From Equations (15), (17), and (18), we know that learning the GaussianFace model amounts to minimizing the follow- ing marginal lik elihood L M odel = − log p ( Z T , θ | X T ) − β M , (19) where the parameter β balances the relati ve importance between the target-domain data and the multi-task learning constraint. 4.4. Optimization For the model optimization, we ﬁrst expand Equation (19) to obtain the follo wing equation (ignoring the constant items) L M odel = − log P T + β P T log P T + β S S X i =1  P T ,i log P i − P T ,i log P T ,i  , (20) where P i = p ( Z i , θ | X i ) and P i,j means that its correspond- ing covariance function is computed on both X i and X j . 5 W e can now optimize Equation (20) with respect to the hyper-parameters θ and the latent positions Z i by the Scaled Conjugate Gradient (SCG) technique. Since we focus on the cov ariance matrix in this paper, here we only present the deriv ations of hyper-parameters. It is easy to get ∂ L M odel ∂ θ j =  β (log P T + 1) − 1 P T  ∂ P T ∂ θ j + β S S X i =1 P T ,i P i · ∂ P i ∂ θ j + β S S X i =1 (log P i − log P T ,i − 1) ∂ P T ,i ∂ θ j . The abo ve equation depends on the form ∂ P i ∂ θ j as follo ws (ignoring the constant items) ∂ P i ∂ θ j = P i ∂ log P i ∂ θ j ≈ P i  ∂ log p ( X i | Z i , θ ) ∂ θ j + ∂ log p ( Z i ) ∂ θ j + ∂ log p ( θ ) ∂ θ j  . The abov e three terms can be easily obtained (ignoring the constant items) by ∂ log p ( X i | Z i , θ ) ∂ θ j ≈ − D 2 T r  K − 1 ∂ K ∂ θ j  + 1 2 T r  K − 1 X i X > i K − 1 ∂ K ∂ θ j  , ∂ log p ( Z i ) ∂ θ j ≈ − 1 σ 2 ∂ J ∗ i ∂ θ j = − 1 λσ 2  a > ∂ K ∂ θ j a − a > ∂ K ∂ θ j ˜ Aa + a > K ˜ A ∂ K ∂ θ j ˜ AKa − a > K ˜ A ∂ K ∂ θ j a  , ∂ log p ( θ ) ∂ θ j = 1 θ j , where ˜ A = A ( λ I n + AKA ) − 1 A . Thus, the desired deriv ativ es hav e been obtained. 4.5. Speedup In the GaussianFace model, we need to inv ert the large matrix when doing inference and prediction. For large problems, both storing the matrix and solving the associated linear systems are computationally prohibitive. In this paper, we use the anchor graphs method [35] to speed up this process. T o put it simply , we ﬁrst select q ( q  n ) anchors to cover a cloud of n data points, and form an n × q matrix Q , where Q i,j = k θ ( z i , z j ) . z i and z j are from n latent data points and q anchors, respecti vely . Then the original kernel matrix K can be approximated as K ≈ QQ > . Using the W oodbury identity [21], computing the n × n matrix QQ > can be transformed into computing the q × q matrix Q > Q , which is more efﬁcient. Speedup on Inference When optimizing Equation (19), we need to in vert the matrix ( λ I n + AKA ) . During inference, we take q k-means clustering centers as anchors to form Q . Substituting K ≈ QQ > into ( λ I n + AKA ) , and then using the W oodbury identity , we get ( λ I n + AKA ) − 1 ≈ ( λ I n + A QQ > A ) − 1 = λ − 1 I n − λ − 1 A Q ( λ I q + Q > AA Q ) − 1 Q > A . Similarly , let K − 1 ≈ ( K + τ I ) − 1 where τ a constant term, then we can get K − 1 ≈ ( K + τ I ) − 1 ≈ τ − 1 I n − τ − 1 Q ( τ I q + Q > Q ) − 1 Q > . Speedup on Prediction When we compute the predic- tiv e variance σ ( z ∗ ) , we need to in vert the matrix ( K + W − 1 ) . At this time, we can use the method in Section 3.2 to calculate the accurate clustering centers that can be regarded as the anchors. Using the W oodbury identity again, we obtain ( K + W − 1 ) − 1 ≈ W − WQ ( I q + Q > W Q ) − 1 Q > W , where ( I q + Q > W Q ) is only a q × q matrix, and its in verse matrix can be computed more efﬁciently . 5. GaussianF ace Model for F ace V eriﬁcation In this section, we describe two applications of the GaussianFace model to face veriﬁcation: as a binary classiﬁer and as a feature extractor . Each face image is ﬁrst normalized to 150 × 120 size by an afﬁne transformation based on ﬁve landmarks (two eyes, nose, and two mouth corners). The image is then divided into ov erlapped patches of 25 × 25 pixels with a stride of 2 pixels. Each patch within the image is mapped to a vector by a certain descriptor, and the vector is regarded as the feature of the patch, denoted by { x A p } P p =1 where P is the number of patches within the face image A . In this paper , the multi-scale LBP feature of each patch is e xtracted [14]. The dif ference is that the multi-scale LBP descriptors are extracted at the center of each patch instead of accurate landmarks. 5.1. GaussianF ace Model as a Binary Classiﬁer For classiﬁcation, our model can be regarded as an approach to learn a cov ariance function for GPC, as sho wn in Figure 1 (a). Here, for a pair of face images A and B from the same (or dif ferent) person, let the similarity v ector x i = [ s 1 , . . . , s p , . . . , s P ] > be the input data point of the GaussianFace model, where s p is the similarity of x A p and 6 x B p , and its corresponding output is y i = 1 (or − 1 ). W ith the learned hyper-parameters of cov ariance function from the training data, gi ven any un-seen pair of face images, we ﬁrst compute its similarity vector x ∗ using the abo ve method, then estimate its latent representation z ∗ using the same method in [57], and ﬁnally predict whether the pair is from the same person through Equation (6). In this paper , we prescribe the sigmoid function π ( · ) to be the cumulativ e Gaussian distrib ution Φ( · ) , which can be solved analytically as ¯ π ∗ = Φ  ¯ f ∗ ( z ∗ ) √ 1+ σ 2 ( z ∗ )  , where σ 2 ( z ∗ ) = K ∗∗ − K ∗ ˜ K − 1 K > ∗ and ¯ f ∗ ( z ∗ ) = K ∗ K − 1 ˆ f from Equation (5) [42]. W e call the method GaussianF ace-BC . 5.2. GaussianF ace Model as a Featur e Extractor As a feature extractor , our model can be regarded as an approach to automatically extract facial features, shown in Figure 1 (b). Here, for a pair of face images A and B from the same (or different) person, we regard the joint feature vector x i = [( x A i ) > , ( x B i ) > ] > as the input data point of the GaussianFace model, and its corresponding output is y i = 1 (or − 1 ). T o enhance the rob ustness of our approach, the ﬂipped form of x i is also included; for example, x i = [( x B i ) > , ( x A i ) > ] > . After the hyper-parameters of cov ariance function are learnt from the training data, we ﬁrst estimate the latent representations of the training data using the same method in [57], then can use the method in Section 3.2 to group the latent data points into different clusters automatically . Suppose that we ﬁnally obtain C clusters. The centers of these clusters are denoted by { c i } C i =1 , the variances of these clusters by { Σ 2 i } C i =1 , and their weights by { w i } C i =1 where w i is the ratio of the number of latent data points from the i -th cluster to the number of all latent data points. Then we refer to each c i as the input of Equation (5), and we can obtain its corresponding probability p i and variance σ 2 i . In fact, { c i } C i =1 can be regarded as a codebook generated by our model. For any un-seen pair of face images, we also ﬁrst compute its joint feature vector x ∗ for each pair of patches, and estimate its latent representation z ∗ . Then we compute its ﬁrst-order and second-order statistics to the centers. Similarly , we regard z ∗ as the input of Equation (5), and can also obtain its corresponding probability p ∗ and v ariance σ 2 ∗ . The statistics and v ariance of z ∗ are represented as its high-dimensional facial features, denoted by ˆ z ∗ = [∆ 1 1 , ∆ 2 1 , ∆ 3 1 , ∆ 4 1 , . . . , ∆ 1 C , ∆ 2 C , ∆ 3 C , ∆ 4 C ] > , where ∆ 1 i = w i  z ∗ − c i Σ i  , ∆ 2 i = w i  z ∗ − c i Σ i  2 , ∆ 3 i = log p ∗ (1 − p i ) p i (1 − p ∗ ) , and ∆ 4 i = σ 2 ∗ σ 2 i . W e then concatenate all of the new high-dimensional features from each pair of patches to form the ﬁnal new high-dimensional feature for the pair of face images. The new high-dimensional facial features A B ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 𝑺 𝟏 𝑺 𝑷 ⋯ GaussianFace Model A s a Binary Classi fier Same Or Different Im age pair Multi - scale Feature Similari ty Vector A B ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ GaussianFace Model A s a Feature Extractor Im age pair Multi - scale Feature Joint Feature Vector and Its Flipped version ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ High -dimensional Feature ⋯ Concatenated Feature (a) (b) Figure 1. T wo approaches based on GaussianFace model for face veriﬁcation. (a) GaussianFace model as a binary classiﬁer . (b) GaussianFace model as a feature e xtractor . not only describe ho w the distrib ution of features of an un-seen face image differs from the distrib ution ﬁtted to the features of all training images, but also encode the predictiv e information including the probabilities of label and uncertainty . W e call this approach GaussianF ace-FE . 6. Experimental Settings In this section, we conduct experiments on face veriﬁ- cation. W e start by introducing the source-domain datasets and the target-domain dataset in all of our e xperiments (see Figure 2 for examples). The source-domain datasets include four different types of datasets as follo ws: Multi-PIE [19]. This dataset contains face images from 337 subjects under 15 view points and 19 illumination conditions in four recording sessions. These images are collected under controlled conditions. MORPH [43]. The MORPH database contains 55,000 images of more than 13,000 people within the age ranges of 16 to 77. There are an average of 4 images per indi vidual. W eb Images 2 . This dataset contains around 40,000 facial images from 3261 subjects; that is, approximately 10 images for each person. The images were collected from the W eb with signiﬁcant variations in pose, expression, and illumination conditions. Life Photos 2 . This dataset contains approximately 5000 images of 400 subjects collected online. Each subject has roughly 10 images. 2 These two datasets are collected by our own from the W eb. It is guaranteed that these two datasets are mutually exclusive with the LFW dataset. 7 Figure 2. Samples of the datasets in our experiments. From left to right: LFW , Multi-PIE, MORPH, W eb Images, and Life Photos. If not otherwise speciﬁed, the target-domain dataset is the benchmark of face veriﬁcation as follo ws: LFW [23]. This dataset contains 13,233 uncontrolled face images of 5749 public ﬁgures with v ariety of pose, lighting, expression, race, ethnicity , age, gender , clothing, hairstyles, and other parameters. All of these images are collected from the W eb. W e use the LFW dataset as the target-domain dataset because it is well known as a challenging benchmark. Using it also allows us to compare directly with other existing face veriﬁcation methods [9, 5, 14, 47, 13, 59, 1, 20, 16]. Besides, this dataset provides a large set of relati vely uncon- strained face images with complex variations as described abov e, and has proven difﬁcult for automatic face veriﬁca- tion methods [23, 28]. In all the experiments conducted on LFW , we strictly follow the standard unrestricted protocol of LFW [23]. More precisely , during the training procedure, the four source-domain datasets are: W eb Images, Multi- PIE, MORPH, and Life Photos, the target-domain dataset is the training set in V iew 1 of LFW , and the validation set is the test set in V iew 1 of LFW . At the test time, we follow the standard 10-fold cross-v alidation protocol to test our model in V iew 2 of LFW . For each one of the four source-domain datasets, we randomly sample 20,000 pairs of matched images and 20,000 pairs of mismatched images. The training partition and the testing partition in all of our experiments are mutually exclusiv e. In other words, there is no identity ov erlap among the two partitions. For the experiments belo w , “The Number of SD” means “the Number of Source-Domain datasets that are fed into the GaussianFace model for training”. By parity of reasoning, if “The Number of SD” is i , that means the ﬁrst i source-domain datasets are used for model training. Therefore, if “The Number of SD” is 0, models are trained with the training data from target-domain data only . Implementation details . Our model in volv es four important parameters: λ in (12), σ in (13), β in (19), and the number of anchors q in Speedup on Inference 3 . Following 3 The other parameters, such as the hyper-parameters in the kernel the same setting in [27], the regularization parameter λ in (12) is ﬁxed to 10 − 8 . σ reﬂects the tradeoff between our method’ s ability to discriminate (small σ ) and its ability to generalize (lar ge σ ), and β balances the relativ e importance between the target-domain data and the multi-task learning constraint. Therefore, the validation set (the test set in V iew 1 of LFW) is used for selecting σ and β . Each time we use different number of source-domain datasets for training, the corresponding optimal σ and β should be selected on the validation set. Since we collected a large number of image pairs for training (20,000 matched pairs and 20,000 mismatched pairs from each source-domain dataset), and our model is based on the kernel method, thus an important consideration is how to efﬁciently approximate the kernel matrix using a lo w-rank method in the limited space and time. W e adopt the anchor graphs method (see Section 4.5) for kernel approximation. In our experiments, we take two steps to determine the number of anchor points. In the ﬁrst step, the optimal σ and β are selected on the validation set in each experiment. In the second step, we ﬁx σ and β , and then tune the number of anchor points. W e vary the number of anchor points to train our model on the training set, and test it on the validation set. W e report the av erage accuracy for our model over 10 trials. After we consider the trade-off between memory and running time in practice, the number of anchor points with the best av erage accuracy is determined in each experiments. 7. Experimental Results In this section, we conduct ﬁv e experiments to demon- strate the validity of the GaussianF ace model. 7.1. Comparisons with Other MTGP/GP Methods Since our model is based on GPs, it is natural to compare our model with four popular GP models: GPC [42], MTGP prediction [6], GPL VM [29], and DGPL VM [57]. For fair comparisons, all these models are trained on multiple source-domain datasets using the same tw o methods as our GaussianFace model described in Section 5. After the hyper-parameters of co variance function are learnt for each model, we can regard each model as a binary classiﬁer and a feature extractor like ours, respectiv ely . Figure 3 shows that our model signiﬁcantly outperforms the other four GPs models, and the superiority of our model becomes more obvious as the number of source-domain datasets increases. 7.2. Comparisons with Other Binary Classiﬁers Since our model can be regarded as a binary classiﬁer , we hav e also compared our method with other classical function and the number of anchors in Speedup on Prediction , can be automatically learned from the data. 8 (a) (b) 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0 1 2 3 4 Accuracy The Number of SD GPC-B C MTGP prediction-B C GPLVM-BC DGPLVM-BC GaussianFace-BC 0.75 0.8 0.85 0.9 0.95 1 0 1 2 3 4 Accuracy The Number of SD GPC-F E MTGP prediction-F E GPLVM-FE DGPLVM-FE GaussianFace-FE 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 1 2 3 4 Relative Improvement The Number of SD GPC-B C MTGP prediction-B C GPLVM-BC DGPLVM-BC GaussianFace-BC 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 1 2 3 4 Relative Improvement The Number of SD GPC-F E MTGP prediction-F E GPLVM-FE DGPLVM-FE GaussianFace-FE (c) (d) Figure 3. (a) The accuracy rate (%) of the GaussianFace-BC model and other competing MTGP/GP methods as a binary classiﬁer . (b) The accuracy rate (%) of the GaussianFace-FE model and other competing MTGP/GP methods as a feature extractor . (c) The relative improv ement of each method as a binary classiﬁer with the increasing number of SD, compared to their performance when the number of SD is 0. (d) The relative improvement of each method as a feature extractor with the increasing number of SD, compared to their performance when the number of SD is 0. binary classiﬁers. For this paper, we chose three pop- ular representatives: SVM [12], logistic regression (LR) [17], and Adaboost [18]. T able 1 demonstrates that the performance of our method GaussianFace-BC is much better than those of the other classiﬁers. Furthermore, these e xperimental results demonstrates the ef fectiveness of the multi-task learning constraint. For example, our GaussianFace-BC has about 7.5% improvement when all four source-domain datasets are used for training, while the best one of the other three binary classiﬁers has only around 4% improv ement. 7.3. Comparisons with Other F eature Extractors Our model can also be regarded as a feature extractor , which is implemented by clustering to generate a code- book. Therefore, we e valuate our method by comparing The Number of SD 0 1 2 3 4 SVM [12] 83.21 84.32 85.06 86.43 87.31 LR [17] 81.14 81.92 82.65 83.84 84.75 Adaboost [18] 82.91 83.62 84.80 86.30 87.21 GaussianFace-BC 86.25 88.24 90.01 92.22 93.73 T able 1. The accuracy rate ( % ) of our methods as a binary classiﬁer and other competing methods on LFW using the increasing number of source-domain datasets. it with three popular clustering methods: K-means [24], Random Projection (RP) tree [15], and Gaussian Mixture Model (GMM) [47]. Since our method can determine the number of clusters automatically , for fair comparison, all the other methods generate the same number of clusters as ours. As shown in T able 2, our method GaussianFace-FE 9 The Number of SD 0 1 2 3 4 K-means [24] 84.71 85.20 85.74 86.81 87.68 RP T ree [15] 85.11 85.70 86.45 87.52 88.34 GMM [47] 86.63 87.02 87.58 88.60 89.21 GaussianFace-FE 89.33 91.04 93.31 95.62 97.79 T able 2. The accuracy rate ( % ) of our methods as a feature ex- tractor and other competing methods on LFW using the increasing number of source-domain datasets. 0 0.05 0.1 0.15 0.2 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 T om - v s - P ete + At tribute ( 93.30%) [5] H igh dim ens ional LBP (95. 17%) [14] F isher V ec tor Faces ( 93.03%) [47] c om bined Joi nt Bayes ian (92.42% ) [13] A s s oc iate-P r edict ( 90.57% ) [59] T L Joi nt Bay es ian (96.33%) [9] V isionLabs (92. 90%) [1] A ur ora ( 93.24% ) [20] F ac e++ ( 97.27%) [16] H um an, c r opped (97.53% ) [28] D eepFace- ens em ble ( 97.35%) [53] Gauss ianF ac e- FE + Gauss ianF ac e-BC ( 98.52%) f alse positive rate true positive ra te Figure 4. The R OC curve on LFW . Our method achieves the best performance, beating human-lev el performance. signiﬁcantly outperforms all of the compared approaches, which veriﬁes the effecti veness of our method as a feature extractor . The results have also proved that the multi-task learning constraint is effecti ve. Each time one dif ferent type of source-domain dataset is added for training, the perfor- mance can be improv ed signiﬁcantly . Our GaussianFace-FE model achieves ov er 8% improvement when the number of SD varies from 0 to 4, which is much higher than the ∼ 3% improv ement of the other methods. 7.4. Comparison with the state-of-art Methods Motiv ated by the appealing performance of both GaussianFace-BC and GaussianFace-FE, we further com- bine them for face veriﬁcation. Speciﬁcally , after f acial fea- tures are extracted using GaussianFace-FE, GaussianFace- BC 4 is used to make the ﬁnal decision. Figure 4 shows the results of this combination compared with state-of-the- art methods [9, 5, 14, 47, 13, 53, 59, 1, 20, 16]. The best published result on the LFW benchmark is 97.35%, which is achie ved by [53]. Our GaussianFace model can improve the accuracy to 98.52%, which for the ﬁrst time beats the human-lev el performance (97.53%, cropped) [28]. Figure 5 presents some example pairs that were always incorrectly 4 Here, the GaussianFace BC is trained with the extracted high- dimensional features using GaussianFace-FE. Figure 5. The two rows present examples of matched and mismatched pairs respectively from LFW that were incorrectly classiﬁed by the GaussianFace model. classiﬁed by our model. Obviously , even for humans, it is also dif ﬁcult to verify some of them. Here, we emphasize that the centers of patches, instead of the accurate and dense facial landmarks like [9], are utilized to extract multi-scale features in our method. This mak es our method simpler and easier to use. 7.5. Further V alidations: Shufﬂing the Source- T arget T o further prove the validity of our model, we also consider to treat Multi-PIE and MORPH respectiv ely as the target-domain dataset and the others as the source- domain datasets. The tar get-domain dataset is split into two mutually exclusi ve parts: one consisting of 20,000 matched pairs and 20,000 mismatched pairs is used for training, the other is used for test. In the test set, similar to the protocol of LFW , we select 10 mutually exclusi ve subsets, where each subset consists of 300 matched pairs and 300 mismatched pairs. The experimental results are presented in Figure 6. Each time one dataset is added to the training set, the performance can be improved, even though the types of data are very dif ferent in the training set. 8. General Discussion There is an implicit belief among many psychologists and computer scientists that human face veriﬁcation abili- ties are currently beyond e xisting computer-based face ver - iﬁcation algorithms [39]. This belief, howe ver , is supported more by anecdotal impression than by scientiﬁc evidence. By contrast, there hav e already been a number of papers comparing human and computer-based face veriﬁcation performance [2, 54, 40, 41, 38, 8]. It has been shown that the best current face veriﬁcation algorithms perform better than humans in the good and moderate conditions. So, it is really not that difﬁcult to beat human performance in some speciﬁc scenarios. As pointed out by [38, 48], humans and computer-based algorithms ha ve different strategies in face veriﬁcation. 10 (a) (b) 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 1 2 3 4 Accuracy The Number of SD GaussianFace-BC GaussianFace-FE GaussianFace-FE + GaussianFace-BC 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0 1 2 3 4 Accuracy The Number of SD GaussianFace-BC GaussianFace-FE GaussianFace-FE + GaussianFace-BC Figure 6. (a) The accuracy rate (%) of the GaussianFace model on Multi-PIE. (b) The accuracy rate (%) of the GaussianFace model on MORPH. Indeed, by contrast to performance with unfamiliar faces, human face veriﬁcation abilities for familiar faces are relativ ely robust to changes in viewing parameters such as illumination and pose. For e xample, Bruce [7] found human recognition memory for unfamiliar faces dropped substantially when there were changes in viewing param- eters. Besides, humans can take advantages of non-face conﬁgurable information from the combination of the face and body (e.g., neck, shoulders). It has also been examined in [28], where the human performance drops from 99.20% (tested using the original LFW images) to 97.53% (tested using the cropped LFW images). Hence, the experiments comparing human and computer performance may not show human face veriﬁcation skill at their best, because humans were ask ed to match the cropped faces of people previously unfamiliar to them. T o the contrary , those experiments can fully show the performance of computer-based face veriﬁcation algorithms. First, the algorithms can exploit information from enough training images with v ariations in all viewing parameters to improv e face veriﬁcation perfor- mance, which is similar to information humans acquire in dev eloping face veriﬁcation skills and in becoming familiar with indi viduals. Second, the algorithms might exploit useful, but subtle, image-based detailed information that giv e them a slight, but consistent, advantage o ver humans. Therefore, surpassing the human-lev el performance may only be symbolically signiﬁcant. In reality , a lot of challenges still lay ahead. T o compete successfully with humans, more factors such as the robustness to familiar faces and the usage of non-face information, need to be con- sidered in dev eloping future face veriﬁcation algorithms. 9. Conclusion and Future W ork This paper presents a principled Multi-T ask Learning approach based on Discriminativ e Gaussian Process Latent V ariable Model, named GaussianF ace , for face veriﬁcation by including a computationally more ef ﬁcient equi valent form of KFDA and the multi-task learning constraint to the DGPL VM model. W e use Gaussian Processes approx- imation and anchor graphs to speed up the inference and prediction of our model. Based on the GaussianFace model, we propose two different approaches for face veriﬁcation. Extensiv e experiments on challenging datasets validate the efﬁcac y of our model. The GaussianFace model ﬁnally surpassed human-lev el face veriﬁcation accuracy , thanks to exploiting additional data from multiple source-domains to improv e the generalization performance of face veriﬁcation in the target-domain and adapting automatically to complex face v ariations. Although se veral techniques such as the Laplace approx- imation and anchor graph are introduced to speed up the process of inference and prediction in our GaussianFace model, it still tak es a long time to train our model for the high performance. In addition, lar ge memory is also necessary . Therefore, for speciﬁc application, one needs to balance the three dimensions: memory , running time, and performance. Generally speaking, higher performance re- quires more memory and more running time. In the future, the issue of running time can be further addressed by the distributed parallel algorithm or the GPU implementation of large matrix in version. T o address the issue of memory , some online algorithms for training need to be developed. Another more intuitiv e method is to seek a more efﬁcient sparse representation for the large co variance matrix. Acknowledgements W e would like to thank Deli Zhao and Chen Change Loy for their insightful discussions. This work is partially supported by ”CUHK Computer V ision Cooperation” grant from Hua wei, and by the General Research Fund sponsored 11 by the Research Grants Council of Hong K ong (Project No.CUHK 416510 and 416312) and Guangdong Innov ative Research T eam Program (No.201001D0104648280). References [1] V isionlabs. In W ebsite: http://www .visionlabs.ru/face- r ecognition . [2] A. Adler and M. E. Schuckers. Comparing human and automatic face recognition performance. IEEE T ransactions on Systems, Man, and Cybernetics, P art B: Cybernetics , 37(5):1248–1255, 2007. [3] T . Ahonen, A. Hadid, and M. Pietikainen. F ace description with local binary patterns: Application to face recognition. TP AMI , 28(12):2037–2041, 2006. [4] A. Ben-Hur , D. Horn, H. T . Siegelmann, and V . V ap- nik. Support vector clustering. JMLR , 2, 2002. [5] T . Berg and P . N. Belhumeur . T om-vs-pete classiﬁers and identity-preserving alignment for face veriﬁca- tion. In BMVC , volume 1, page 5, 2012. [6] E. Bonilla, K. M. Chai, and C. W illiams. Multi-task gaussian process prediction. In NIPS , 2008. [7] V . Bruce. Changing faces: V isual and non-visual coding processes in face recognition. British Journal of Psychology , 73(1):105–116, 1982. [8] V . Bruce, P . J. Hancock, and A. M. Burton. Com- parisons between human and computer recognition of faces. In Automatic F ace and Gestur e Recognition , pages 408–413, 1998. [9] X. Cao, D. W ipf, F . W en, and G. Duan. A practical transfer learning algorithm for face veriﬁcation. In ICCV . 2013. [10] Z. Cao, Q. Y in, X. T ang, and J. Sun. Face recognition with learning-based descriptor . In CVPR , pages 2707– 2714, 2010. [11] K. M. Chai. Multi-task learning with gaussian processes. The University of Edinb urgh, 2010. [12] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. A CM TIST , 2(3):27, 2011. [13] D. Chen, X. Cao, L. W ang, F . W en, and J. Sun. Bayesian f ace re visited: A joint formulation. In ECCV , pages 566–579. 2012. [14] D. Chen, X. Cao, F . W en, and J. Sun. Blessing of dimensionality: High-dimensional feature and its efﬁcient compression for face veriﬁcation. In CVPR . 2013. [15] S. Dasgupta and Y . Freund. Random projection trees for vector quantization. IEEE T ransactions on Information Theory , 55(7):3229–3242, 2009. [16] H. Fan, Z. Cao, Y . Jiang, Q. Y in, and C. Doudou. Learning deep face representation. arXiv preprint arXiv:1403.2802 , 2014. [17] R.-E. F an, K.-W . Chang, C.-J. Hsieh, X.-R. W ang, and C.-J. Lin. Liblinear: A library for large linear classiﬁcation. JMLR , 9:1871–1874, 2008. [18] Y . Freund, R. Schapire, and N. Abe. A short introduction to boosting. Journal-J apanese Society F or Artiﬁcial Intelligence , 14(771-780):1612, 1999. [19] R. Gross, I. Matthews, J. Cohn, T . Kanade, and S. Baker . Multi-pie. Image and V ision Computing , 28(5):807–813, 2010. [20] T . Heseltine, P . Szeptycki, J. Gomes, M. Ruiz, and P . Li. Aurora face recognition technical report: Evaluation of algorithm aurora-c-2014-1 on labeled faces in the wild. [21] N. J. Higham. Accuracy and Stability of Numberical Algorithms . Number 48. Siam, 1996. [22] G. Huang, H. Lee, and E. Learned-Miller . Learning hierarchical representations for face veriﬁcation with con volutional deep belief networks. In CVPR , 2012. [23] G. B. Huang, M. Ramesh, T . Berg, and E. Learned- Miller . Labeled faces in the wild: A database for studying face recognition in unconstrained en- vironments. T echnical Report 07-49, University of Massachusetts, Amherst, 2007. [24] S. U. Hussain, T . Napol ´ eon, F . Jurie, et al. Face recognition using local quantized patterns. In BMVC , 2012. [25] H.-C. Kim, D. Kim, Z. Ghahramani, and S. Y . Bang. Appearance-based gender classiﬁcation with gaussian processes. P attern Recognition Letters , 27(6):618– 626, 2006. [26] H.-C. Kim and J. Lee. Clustering based on gaussian processes. Neural computation , 19(11), 2007. [27] S.-J. Kim, A. Magnani, and S. Boyd. Optimal kernel selection in kernel ﬁsher discriminant analysis. In ICML , pages 465–472, 2006. [28] N. Kumar , A. C. Berg, P . N. Belhumeur, and S. K. Nayar . Attribute and simile classiﬁers for face veriﬁ- cation. In ICCV , pages 365–372, 2009. [29] N. D. Lawrence. Gaussian process latent v ariable models for visualisation of high dimensional data. In NIPS , volume 2, page 5, 2003. [30] G. Leen, J. Peltonen, and S. Kaski. Focused multi- task learning using gaussian processes. In Machine Learning and Knowledge Discovery in Databases , pages 310–325. 2011. [31] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Y ang. Probabilistic elastic matching for pose v ariant face veriﬁcation. In CVPR . 2013. 12 [32] Z. Li, D. Lin, and X. T ang. Nonparametric discrimi- nant analysis for face recognition. TP AMI , 31(4):755– 761, 2009. [33] Z. Li, W . Liu, D. Lin, and X. T ang. Nonparametric subspace analysis for face recognition. In CVPR , volume 2, pages 961–966, 2005. [34] C. Liu and H. W echsler . Gabor feature based classi- ﬁcation using the enhanced ﬁsher linear discriminant model for face recognition. TIP , 2002. [35] W . Liu, J. He, and S.-F . Chang. Large graph construction for scalable semi-supervised learning. In ICML , pages 679–686, 2010. [36] D. G. Lowe. Distinctive image features from scale- in variant ke ypoints. IJCV , 60(2):91–110, 2004. [37] B. Moghaddam, T . Jebara, and A. Pentland. Bayesian face recognition. P attern Recognition , 33(11):1771– 1782, 2000. [38] A. J. O’T oole, X. An, J. Dunlop, V . Natu, and P . J. Phillips. Comparing face recognition algorithms to humans on challenging tasks. ACM T ransactions on Applied P erception , 9(4):16, 2012. [39] A. J. OT oole, F . Jiang, D. Roark, and H. Abdi. Predicting human performance for face recognition. F ace Processing: Advanced Methods and Models. Elsevier , Amster dam , 2006. [40] A. J. O’T oole, P . J. Phillips, F . Jiang, J. A yyad, N. P ´ enard, and H. Abdi. Face recognition algorithms surpass humans matching faces over changes in illu- mination. TP AMI , 29(9):1642–1646, 2007. [41] P . J. Phillips and A. J. O’T oole. Comparison of human and computer performance across face recognition experiments. Image and V ision Computing , 32(1):74– 85, 2014. [42] C. E. Rasmussen and C. K. I. W illiams. Gaussian processes for machine learning. 2006. [43] K. Ricanek and T . T esafaye. Morph: A longitudinal image database of normal adult age-progression. In Automatic F ace and Gestur e Recognition , pages 341– 345, 2006. [44] O. Rudovic, I. Patras, and M. Pantic. Coupled gaussian process regression for pose-in variant f acial expression recognition. In ECCV . 2010. [45] R. Salakhutdinov and G. E. Hinton. Using deep belief nets to learn cov ariance kernels for gaussian processes. In NIPS , 2007. [46] H. J. Seo and P . Milanfar . Face veriﬁcation using the lark representation. TIFS , 6(4):1275–1286, 2011. [47] K. Simonyan, O. M. Parkhi, A. V edaldi, and A. Zisser - man. Fisher vector faces in the wild. IJCV , 60(2):91– 110, 2004. [48] P . Sinha, B. Balas, Y . Ostrovsk y , and R. Russell. Face recognition by humans: 20 results all computer vision researchers should kno w about. Department of Brain and Cognitive Sciences, MIT , Cambridge, MA , 2005. [49] G. Skolidis and G. Sanguinetti. Bayesian multitask classiﬁcation with gaussian process priors. IEEE T ransactions on Neural Networks , 22(12), 2011. [50] Y . Sun, X. W ang, and X. T ang. Hybrid deep learning for face veriﬁcation. In ICCV . 2013. [51] Y . Sun, X. W ang, and X. T ang. Deep learning face representation from predicting 10,000 classes. In CVPR , 2014. [52] Y . T aigman, L. W olf, and T . Hassner . Multiple one- shots for utilizing class label information. In BMVC , pages 1–12, 2009. [53] Y . T aigman, M. Y ang, M. Ranzato, and L. W olf. Deep- Face: Closing the Gap to Human-Level Performance in Face V eriﬁcation. CVPR, 2014 . [54] X. T ang and X. W ang. F ace sketch recognition. IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , 14(1):50–57, 2004. [55] A. T orralba and A. A. Efros. Unbiased look at dataset bias. In CVPR , pages 1521–1528, 2011. [56] M. A. T urk and A. P . Pentland. Face recognition using eigenfaces. In CVPR , pages 586–591, 1991. [57] R. Urtasun and T . Darrell. Discriminative gaussian process latent variable model for classiﬁcation. In ICML , pages 927–934, 2007. [58] J. Wright and G. Hua. Implicit elastic matching with random projections for pose-variant face recognition. In CVPR , pages 1502–1509, 2009. [59] Q. Y in, X. T ang, and J. Sun. An associate-predict model for face recognition. In CVPR , pages 497–504, 2011. [60] K. Y u, V . T resp, and A. Schwaighofer . Learning gaussian processes from multiple tasks. In ICML , pages 1012–1019, 2005. [61] Y . Zhang and D.-Y . Y eung. Multi-task warped gaussian process for personalized age estimation. In CVPR , pages 2622–2629, 2010. [62] Z. Zhu, P . Luo, X. W ang, and X. T ang. Deep learning identity preserving face space. In ICCV . 2013. [63] Z. Zhu, P . Luo, X. W ang, and X. T ang. Recover canonical-view faces in the wild with deep neural networks. , 2014. 13

Surpassing Human-Level Face Verification Performance on LFW with GaussianFace

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment