# «Synergistic Face Detection and Pose Estimation with Energy-Based Models Margarita Osadchy RITA Department of Computer Science ...»

Journal of Machine Learning Research 8 (2007) 1197-1215 Submitted 8/05; Revised 2/07; Published 5/07

Synergistic Face Detection and Pose Estimation

with Energy-Based Models

Margarita Osadchy RITA@CS.HAIFA.AC.IL

Department of Computer Science

University of Haifa

Mount Carmel, Haifa 31905,Israel

Yann Le Cun YANN@CS.NYU.EDU The Courant Institute New York University New York, NY 10003, USA Matthew L. Miller MLM@NEC-LABS.COM NEC Labs America Princeton NJ 08540, USA Editor: Pietro Perona Abstract We describe a novel method for simultaneously detecting faces and estimating their pose in real time. The method employs a convolutional network to map images of faces to points on a lowdimensional manifold parametrized by pose, and images of non-faces to points far away from that manifold. Given an image, detecting a face and estimating its pose is viewed as minimizing an energy function with respect to the face/non-face binary variable and the continuous pose parameters.

The system is trained to minimize a loss function that drives correct combinations of labels and pose to be associated with lower energy values than incorrect ones.

The system is designed to handle very large range of poses without retraining. The performance of the system was tested on three standard data sets—for frontal views, rotated faces, and proﬁles— is comparable to previous systems that are designed to handle a single one of these data sets.

We show that a system trained simuiltaneously for detection and pose estimation is more accurate on both tasks than similar systems trained for each task separately.1 Keywords: face detection, pose estimation, convolutional networks, energy based models, object recognition

1. Introduction The detection of human faces in natural images and videos is a key component in a wide variety of applications of human-computer interaction, search and indexing, security, and surveillance. Many real-world applications would proﬁt from view-independent detectors that can detect faces under a wide range of poses: looking left or right (yaw axis), up or down (pitch axis), or tilting left or right (roll axis).

In this paper we describe a novel method that can not only detect faces independently of their poses, but also simultaneously estimate those poses. The system is highly reliable, runs in real time

1. A more preliminary version of this work appears as: Osadchy et al. (2005).

c 2007 Margarita Osadchy, Yan Le Cun and Matthew L.

on standard hardware, and is robust to variations in yaw (±90 ◦ ), roll (±45◦ ), pitch (±60◦ ), as well as partial occlusions.

The method is motivated by the idea that multi-view face detection and pose estimation are so closely related that they should not be performed separately. The tasks are related in the sense that they could use similar features and internal representations, and must be robust against the same sorts of variation: skin color, glasses, facial hair, lighting, scale, expressions, etc. We suspect that, when trained together, each task can serve as an inductive bias for the other, yielding better generalization or requiring fewer training examples (Caruana, 1997).

To exploit the synergy between these two tasks, we train a learning machine to map input images to points in a low-dimensional space. In the low-dimensional output space we embed a face manifold which is parameterized by facial pose parameters (e.g., pitch, yaw, and roll). A convolutional network is trained to map face images to points on the face manifold that correspond to the pose of the faces and non-face images to points far away from that manifold. After training, a detection is performed by measuring whether the distance of the output point from the manifold is lower than a threshold. If the point is close to the manifold, indicating that a face is present in the image, its pose parameters can be inferred from the position of the projection of the point onto the manifold.

To map input images to points in the low-dimensional space, we employ a convolutional network architecture (LeCun et al., 1998). Convolutional networks are speciﬁcally designed to learn invariant representation of images. They can easily learn the type of shift-invariant local features that are relevant to face detection and pose estimation. More importantly, they can be replicated over large images (applied to every sub-windows in a large image) at a small fraction of the cost of applying more traditional classiﬁers to every sub-windows in an image. This is a considerable advantage for building real-time systems.

As a learning machine we use the recently proposed Energy-Based Models (EBM) that provide a description and the inference process and the learning process in a single, well-principled framework (LeCun and Huang, 2005; LeCun et al., 2006).

Given an input (an image), an Energy-Based Model associates an energy to each conﬁguration of the variables to be modeled (the face/non-face label and the pose parameters in our case). Making an inference with an EBM consists in searching for a conﬁguration of the variables to be predicted that minimizes the energy, or comparing the energies of a small number of conﬁgurations of those variables. EBMs have a number of advantages over probabilistic models: (1) There is no need to compute partition functions (normalization constants) that may be intractable; (2) because there is no requirement for normalization, the repertoire of possible model architectures that can be used is considerably richer. In our application we deﬁne an Energy-Based Model as a scalar-valued energy function of three variables: image, label, and pose, and we treat pose as a deterministic latent variable. Thus both label of an image and pose are inferred through the energy-minimization process.

Training an EBM consists in ﬁnding values of the trainable parameters (which parameterize the energy function) that associate low energies to “desired” conﬁgurations of variables, and high energies to “undesired” conﬁgurations. With probabilistic models, making the probability of some values large automatically makes the probabilities of other values small because of the normalization. With EBM’s making the energy of desired conﬁgurations low may not necessarily make the energies of other conﬁgurations high. Therefore, one must be very careful when designing loss functions for EBMs. In our application to face detection we derive a new type of contrastive loss function that is tailored to such detection tasks.

## SYNERGISTIC FACE DETECTION AND POSE ESTIMATION WITH ENERGY-BASED MODELS

The paper is organized as follows. First, some of the relevant prior works on multi-view face detection are brieﬂy discussed. Section 2 discusses the synergy between pose estimation and face detection, and describes the basic methods for integrating them. Section 3 discusses the learning machine, and Section 4 gives the results of experiments conducted with our system. Section 5 draws some conclusions.**1.1 Previous Work**

Learning-based approaches to face detection abound, including real-time methods (Viola and Jones, 2001), and approaches based on convolutional networks (Vaillant et al., 1994; Garcia and Delakis, 2002). Most multi-view systems take a view-based approach, which involves building separate detectors for different views and either applying them in parallel (Pentland et al., 1994; Sung and Poggio, 1998; Schneiderman and Kanade, 2000; Li et al., 2002) or using a pose estimator to select the most appropriate detector (Jones and Viola, 2003; Huang et al., 2004). Another approach is to estimate and correct in-plane rotations before applying a single pose-speciﬁc detector (Rowley et al., 1998b). Some attempts have been done in integrating pose search and detection, but in much smaller space of pose parameters (Fleuret and Geman, 2001).

Closer to our approach is that of Li et al. (2000), in which a number of Support Vector Regressors are trained to approximate smooth functions, each of which has a maximum for a face at a particular pose. Another machine is trained to convert the resulting values to estimates of poses, and a third machine is trained to convert the values into a face/non-face score. The resulting system is rather slow. See Yang et al. (2002) for survey of face detection methods.

2. Integrating Face Detection and Pose Estimation To exploit the posited synergy between face detection and pose estimation, we must design a system that integrates the solutions to the two problems. Merely cascading two systems where the answer to one problem is used to assist in solving the other will not optimally take advantage of the synergy.

Therefore, both answers must be derived from one underlying analysis of the input, and both tasks must be trained together.

Our approach is to build a trainable system that can map raw images X to points in a lowdimensional space (Figure 1). In that space, we pre-deﬁne a face manifold F(Z) that we parameterize by the pose Z. We train the system to map face images with known poses to the corresponding points on the manifold. We also train it to map images of non-faces to points far away from the manifold. During recognition, the system maps the input image X to a point in the low dimensional space G(X). The proximity of G(X) to the manifold then tells us whether or not an image X is a face. By ﬁnding the pose parameters Z that correspond to the point on the manifold that is closest to the point G(X) (projection), we obtain an estimate of the pose (Figure 2).

**2.1 Parameterizing the Face Manifold**

We will now describe the details of the parameterizations of the face manifold. Three criteria directed the design of the face manifold: (1) preserving the topology and geometry of the problem;

(2) providing enough space for mapping the background images far from the manifold (since the proximity to the manifold indicates whether the input image contains a face); and (3) minimizing

## OSADCHY, LECUN AND MILLER

The point on the manifold closest to G(X) is just F(θ).

The function choice is not limited to cosine. However cosines are preferable since they allow computing the pose analytically from the output of the network. Without this property, ﬁnding the pose could be an expensive optimization process, or even require the use of a second learning machine.

The same idea can be generalized to any number of pose parameters. Let us consider the set of all faces with yaw in [−90, 90] and roll in [−45, 45]. In an

**Abstract**

way, this set is isomorphic to a portion of a sphere. Consequently, we can represent a point on the face manifold as a function of the two pose parameters by 9 basis functions that are the cross-products of three shifted cosines for

**one of the angles, and three shifted cosines for the other angle:**

Fi j (θ, φ) = cos(θ − αi ) cos(φ − β j ); i, j = 1, 2, 3.

For convenience, we rescale the roll angles to the range [−90, 90] which allows us to set β i = αi.

With this parameterization, the manifold has constant curvature, which ensures that the effect of errors will be the same regardless of pose. Given a 9-dimensional output vector from the convolutional

**network Gi j (X), we compute the corresponding yaw and roll angles θ, φ as follows:**

cc = ∑i j Gi j (X) cos(αi ) cos(β j ); cs = ∑i j Gi j (X) cos(αi ) sin(β j );

sc = ∑i j Gi j (X) sin(αi ) cos(β j ); ss = ∑i j Gi j (X) sin(αi ) sin(β j );

θ = 0.5(atan2(cs + sc, cc − ss) + atan2(sc − cs, cc + ss)) ;

φ = 0.5(atan2(cs + sc, cc − ss) − atan2(sc − cs, cc + ss)).

The process can easily be extended to include pitch in addition to yaw and roll, as well as other parameters if necessary.

## OSADCHY, LECUN AND MILLER

Figure 3: Left: face manifold embedding; right: manifold parametrization by single pose parameter. The value of each cosine function for one pose angle constitute the three components of a point on the face manifold corresponding to that pose.

3. Learning Machine To map input images to points in the low-dimensional space, we employ a convolutional network architecture trained using Energy Minimization Framework. Next we present the details of the learning machine.

3.1 Energy Minimization Framework We propose the following conﬁguration of the Energy Based Model (LeCun and Huang, 2005;

LeCun et al., 2006). Consider a scalar-valued function EW (Y, Z, X), where X is a raw image, Z is a facial pose (e.g., yaw and roll as deﬁned above), Y is a binary label: Y = 1 for face, Y = 0 for non-face. W is a parameter vector subject to learning. EW (Y, Z, X) can be interpreted as an energy function that measures the degree of compatibility between the values of X, Z,Y. The inference process consists in clamping X to the observed value (the image), and searching for conﬁgurations

**of Z and Y that minimize the energy EW (Y, Z, X):**

(Y, Z) = argminY ∈{Y }, Z∈{Z} EW (Y, Z, X) where {Y } = {0, 1} and {Z} = [−90, 90] × [−45, 45] for yaw and roll variables.

Ideally, if the input X is the image of a face with pose Z, then a properly trained system

**should give a lower energy to the face label Y = 1 than to the non-face label Y = 0 for any pose:**

EW (1, Z, X) EW (0, Z, X), ∀Z. For accurate pose estimation, the system should give a lower energy to the correct pose than to any other pose: EW (1, Z, X) EW (1, Z, X), ∀Z = Z. Training a machine to satisfy those two conditions for any image will guarantee that the energy-minimizing inference process will produce the correct answer.

**Transforming energies to probabilities can easily be done via Gibbs distribution:**

Z P(Y, Z|X) = exp(−βEW (Y, Z, X))/ exp(−βEW (y, z, X)) y∈{Y },z∈{Z} where β is an arbitrary positive constant, and {Y } and {Z} are the sets of possible values of y and z. With this formulation, we can easily interpret the energy minimization with respect to Y and Z as