FREE ELECTRONIC LIBRARY - Thesis, dissertations, books

Pages:   || 2 | 3 |

«Synergistic Face Detection and Pose Estimation with Energy-Based Models Margarita Osadchy RITA Department of Computer Science ...»

-- [ Page 1 ] --

Journal of Machine Learning Research 8 (2007) 1197-1215 Submitted 8/05; Revised 2/07; Published 5/07

Synergistic Face Detection and Pose Estimation

with Energy-Based Models

Margarita Osadchy RITA@CS.HAIFA.AC.IL

Department of Computer Science

University of Haifa

Mount Carmel, Haifa 31905,Israel

Yann Le Cun YANN@CS.NYU.EDU The Courant Institute New York University New York, NY 10003, USA Matthew L. Miller MLM@NEC-LABS.COM NEC Labs America Princeton NJ 08540, USA Editor: Pietro Perona Abstract We describe a novel method for simultaneously detecting faces and estimating their pose in real time. The method employs a convolutional network to map images of faces to points on a lowdimensional manifold parametrized by pose, and images of non-faces to points far away from that manifold. Given an image, detecting a face and estimating its pose is viewed as minimizing an energy function with respect to the face/non-face binary variable and the continuous pose parameters.

The system is trained to minimize a loss function that drives correct combinations of labels and pose to be associated with lower energy values than incorrect ones.

The system is designed to handle very large range of poses without retraining. The performance of the system was tested on three standard data sets—for frontal views, rotated faces, and profiles— is comparable to previous systems that are designed to handle a single one of these data sets.

We show that a system trained simuiltaneously for detection and pose estimation is more accurate on both tasks than similar systems trained for each task separately.1 Keywords: face detection, pose estimation, convolutional networks, energy based models, object recognition

1. Introduction The detection of human faces in natural images and videos is a key component in a wide variety of applications of human-computer interaction, search and indexing, security, and surveillance. Many real-world applications would profit from view-independent detectors that can detect faces under a wide range of poses: looking left or right (yaw axis), up or down (pitch axis), or tilting left or right (roll axis).

In this paper we describe a novel method that can not only detect faces independently of their poses, but also simultaneously estimate those poses. The system is highly reliable, runs in real time

1. A more preliminary version of this work appears as: Osadchy et al. (2005).

c 2007 Margarita Osadchy, Yan Le Cun and Matthew L.

–  –  –

on standard hardware, and is robust to variations in yaw (±90 ◦ ), roll (±45◦ ), pitch (±60◦ ), as well as partial occlusions.

The method is motivated by the idea that multi-view face detection and pose estimation are so closely related that they should not be performed separately. The tasks are related in the sense that they could use similar features and internal representations, and must be robust against the same sorts of variation: skin color, glasses, facial hair, lighting, scale, expressions, etc. We suspect that, when trained together, each task can serve as an inductive bias for the other, yielding better generalization or requiring fewer training examples (Caruana, 1997).

To exploit the synergy between these two tasks, we train a learning machine to map input images to points in a low-dimensional space. In the low-dimensional output space we embed a face manifold which is parameterized by facial pose parameters (e.g., pitch, yaw, and roll). A convolutional network is trained to map face images to points on the face manifold that correspond to the pose of the faces and non-face images to points far away from that manifold. After training, a detection is performed by measuring whether the distance of the output point from the manifold is lower than a threshold. If the point is close to the manifold, indicating that a face is present in the image, its pose parameters can be inferred from the position of the projection of the point onto the manifold.

To map input images to points in the low-dimensional space, we employ a convolutional network architecture (LeCun et al., 1998). Convolutional networks are specifically designed to learn invariant representation of images. They can easily learn the type of shift-invariant local features that are relevant to face detection and pose estimation. More importantly, they can be replicated over large images (applied to every sub-windows in a large image) at a small fraction of the cost of applying more traditional classifiers to every sub-windows in an image. This is a considerable advantage for building real-time systems.

As a learning machine we use the recently proposed Energy-Based Models (EBM) that provide a description and the inference process and the learning process in a single, well-principled framework (LeCun and Huang, 2005; LeCun et al., 2006).

Given an input (an image), an Energy-Based Model associates an energy to each configuration of the variables to be modeled (the face/non-face label and the pose parameters in our case). Making an inference with an EBM consists in searching for a configuration of the variables to be predicted that minimizes the energy, or comparing the energies of a small number of configurations of those variables. EBMs have a number of advantages over probabilistic models: (1) There is no need to compute partition functions (normalization constants) that may be intractable; (2) because there is no requirement for normalization, the repertoire of possible model architectures that can be used is considerably richer. In our application we define an Energy-Based Model as a scalar-valued energy function of three variables: image, label, and pose, and we treat pose as a deterministic latent variable. Thus both label of an image and pose are inferred through the energy-minimization process.

Training an EBM consists in finding values of the trainable parameters (which parameterize the energy function) that associate low energies to “desired” configurations of variables, and high energies to “undesired” configurations. With probabilistic models, making the probability of some values large automatically makes the probabilities of other values small because of the normalization. With EBM’s making the energy of desired configurations low may not necessarily make the energies of other configurations high. Therefore, one must be very careful when designing loss functions for EBMs. In our application to face detection we derive a new type of contrastive loss function that is tailored to such detection tasks.


The paper is organized as follows. First, some of the relevant prior works on multi-view face detection are briefly discussed. Section 2 discusses the synergy between pose estimation and face detection, and describes the basic methods for integrating them. Section 3 discusses the learning machine, and Section 4 gives the results of experiments conducted with our system. Section 5 draws some conclusions.

1.1 Previous Work

Learning-based approaches to face detection abound, including real-time methods (Viola and Jones, 2001), and approaches based on convolutional networks (Vaillant et al., 1994; Garcia and Delakis, 2002). Most multi-view systems take a view-based approach, which involves building separate detectors for different views and either applying them in parallel (Pentland et al., 1994; Sung and Poggio, 1998; Schneiderman and Kanade, 2000; Li et al., 2002) or using a pose estimator to select the most appropriate detector (Jones and Viola, 2003; Huang et al., 2004). Another approach is to estimate and correct in-plane rotations before applying a single pose-specific detector (Rowley et al., 1998b). Some attempts have been done in integrating pose search and detection, but in much smaller space of pose parameters (Fleuret and Geman, 2001).

Closer to our approach is that of Li et al. (2000), in which a number of Support Vector Regressors are trained to approximate smooth functions, each of which has a maximum for a face at a particular pose. Another machine is trained to convert the resulting values to estimates of poses, and a third machine is trained to convert the values into a face/non-face score. The resulting system is rather slow. See Yang et al. (2002) for survey of face detection methods.

2. Integrating Face Detection and Pose Estimation To exploit the posited synergy between face detection and pose estimation, we must design a system that integrates the solutions to the two problems. Merely cascading two systems where the answer to one problem is used to assist in solving the other will not optimally take advantage of the synergy.

Therefore, both answers must be derived from one underlying analysis of the input, and both tasks must be trained together.

Our approach is to build a trainable system that can map raw images X to points in a lowdimensional space (Figure 1). In that space, we pre-define a face manifold F(Z) that we parameterize by the pose Z. We train the system to map face images with known poses to the corresponding points on the manifold. We also train it to map images of non-faces to points far away from the manifold. During recognition, the system maps the input image X to a point in the low dimensional space G(X). The proximity of G(X) to the manifold then tells us whether or not an image X is a face. By finding the pose parameters Z that correspond to the point on the manifold that is closest to the point G(X) (projection), we obtain an estimate of the pose (Figure 2).

2.1 Parameterizing the Face Manifold

We will now describe the details of the parameterizations of the face manifold. Three criteria directed the design of the face manifold: (1) preserving the topology and geometry of the problem;

(2) providing enough space for mapping the background images far from the manifold (since the proximity to the manifold indicates whether the input image contains a face); and (3) minimizing


–  –  –

The point on the manifold closest to G(X) is just F(θ).

The function choice is not limited to cosine. However cosines are preferable since they allow computing the pose analytically from the output of the network. Without this property, finding the pose could be an expensive optimization process, or even require the use of a second learning machine.

The same idea can be generalized to any number of pose parameters. Let us consider the set of all faces with yaw in [−90, 90] and roll in [−45, 45]. In an


way, this set is isomorphic to a portion of a sphere. Consequently, we can represent a point on the face manifold as a function of the two pose parameters by 9 basis functions that are the cross-products of three shifted cosines for

one of the angles, and three shifted cosines for the other angle:

Fi j (θ, φ) = cos(θ − αi ) cos(φ − β j ); i, j = 1, 2, 3.

For convenience, we rescale the roll angles to the range [−90, 90] which allows us to set β i = αi.

With this parameterization, the manifold has constant curvature, which ensures that the effect of errors will be the same regardless of pose. Given a 9-dimensional output vector from the convolutional

network Gi j (X), we compute the corresponding yaw and roll angles θ, φ as follows:

cc = ∑i j Gi j (X) cos(αi ) cos(β j ); cs = ∑i j Gi j (X) cos(αi ) sin(β j );

sc = ∑i j Gi j (X) sin(αi ) cos(β j ); ss = ∑i j Gi j (X) sin(αi ) sin(β j );

θ = 0.5(atan2(cs + sc, cc − ss) + atan2(sc − cs, cc + ss)) ;

φ = 0.5(atan2(cs + sc, cc − ss) − atan2(sc − cs, cc + ss)).

The process can easily be extended to include pitch in addition to yaw and roll, as well as other parameters if necessary.


–  –  –

Figure 3: Left: face manifold embedding; right: manifold parametrization by single pose parameter. The value of each cosine function for one pose angle constitute the three components of a point on the face manifold corresponding to that pose.

3. Learning Machine To map input images to points in the low-dimensional space, we employ a convolutional network architecture trained using Energy Minimization Framework. Next we present the details of the learning machine.

3.1 Energy Minimization Framework We propose the following configuration of the Energy Based Model (LeCun and Huang, 2005;

LeCun et al., 2006). Consider a scalar-valued function EW (Y, Z, X), where X is a raw image, Z is a facial pose (e.g., yaw and roll as defined above), Y is a binary label: Y = 1 for face, Y = 0 for non-face. W is a parameter vector subject to learning. EW (Y, Z, X) can be interpreted as an energy function that measures the degree of compatibility between the values of X, Z,Y. The inference process consists in clamping X to the observed value (the image), and searching for configurations

of Z and Y that minimize the energy EW (Y, Z, X):

(Y, Z) = argminY ∈{Y }, Z∈{Z} EW (Y, Z, X) where {Y } = {0, 1} and {Z} = [−90, 90] × [−45, 45] for yaw and roll variables.

Ideally, if the input X is the image of a face with pose Z, then a properly trained system

should give a lower energy to the face label Y = 1 than to the non-face label Y = 0 for any pose:

EW (1, Z, X) EW (0, Z, X), ∀Z. For accurate pose estimation, the system should give a lower energy to the correct pose than to any other pose: EW (1, Z, X) EW (1, Z, X), ∀Z = Z. Training a machine to satisfy those two conditions for any image will guarantee that the energy-minimizing inference process will produce the correct answer.

Transforming energies to probabilities can easily be done via Gibbs distribution:

Z P(Y, Z|X) = exp(−βEW (Y, Z, X))/ exp(−βEW (y, z, X)) y∈{Y },z∈{Z} where β is an arbitrary positive constant, and {Y } and {Z} are the sets of possible values of y and z. With this formulation, we can easily interpret the energy minimization with respect to Y and Z as


–  –  –

Pages:   || 2 | 3 |

Similar works:

«Excavations at Tell Timai 2010 University of Hawaii Season 2 May 20 to 14 July 2010 Directors Professor Robert Littman, littman@hawaii.edu Dr. Jay Silverstein, drjsilverstein@gmail.com Dr. Nicholas Hudson Dr. Joshua Trampier Summary An archaeological team sponsored by the University of Hawaii led by Professor Robert Littman and Dr. Jay Silverstein conducted, geophysical survey, auger coring, excavations and mapping between May 20 to 14 July 2010. The geophysical survey focused on identifying...»

«STATE OF RHODE ISLAND AND PROVIDENCE PLANTATIONS WASHINGTON, S.C. SUPERIOR COURT (FILED FEBRUARY 16, 2011) TIMOTHY RAICHE d/b/a : T. RAICHE BUILDERS : : vs. : W.C. No. 2005-0336 :TIMOTHY W. SCOTT and : PAMELA J. SCOTT : DECISION LANPHEAR, J. This matter was tried before the Court, jury-waived. Findings of Fact Timothy and Pamela Scott owned a single family home on Solar Drive in Westerly, Rhode Island. In the fall of 2003 they considered expanding their home to include a master bedroom. After...»

«議 員 個人 利益 監 察委 員會 就 對 田北 辰議 員 及劉 皇發 議 員 的 投 訴提 交的 報 告 Report of the Committee on Members' Interests on complaints against Hon Michael TIEN and Dr Hon LAU Wong-fat 2016 年 7 月 July 2016 中華人民共和國香港特別行政區立法會 LEGISLATIVE COUNCIL OF THE HONG KONG SPECIAL ADMINISTRATIVE REGION OF THE PEOPLE'S REPUBLIC OF CHINA CONTENTS Page Chapter 1 Introduction 1 2 Complaints against Hon Michael TIEN 4 3 Complaints...»

«Creative Education, 2015, 6, 1993-2001 Published Online October 2015 in SciRes. http://www.scirp.org/journal/ce http://dx.doi.org/10.4236/ce.2015.618204 A Study of Word Meaning Acquisition of Polysemous Words for English Majors Xiaoxu Wei1,2, Yougen Lou2* College of Foreign Language, Hunan University, Changsha, China School of Foreign Languages, Yangtze University, Jinzhou, Hubei, China Email: wxxllq@126.com, *louyougen@163.com Received 18 August 2015; accepted 27 October 2015; published 30...»

«Socio-Technical Interaction Networks in Free/Open Source Software Development Processes Walt Scacchi Institute for Software Research School of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425 USA +1-949-824-4130 (v), +1-949-824-1715 (f) Wscacchi@uci.edu September 2004 Previous version: May 2004 Revised version to appear in S.T. Acuña and N. Juristo (eds.), Software Process Modeling, pp. 1-27, Springer Science+Business Media Inc., New York, 2005. This...»

«Dragon Feathers Washington is embarked the crucial awareness in no as the most high noteworthy factors as the process. Directly, another LLC inspires achieved price in the company meeting media if a major same pdf firms buying software, training and everyone. Almost the package download some communication for it spend cards offshore, and seek the stood products for category. Us actually will be in I do its types. Them must control a comparable right talking in it ask machining stories and using...»

«Ocean Explorers Registration Form Which Ocean Explorers Program are you signing up for?  Ocean Discovery – June 4th  Marine Life, Big and Small – June 11th  Lagoon Trekkers – June 18th  Deep Sea Discovery – June 25th Child’s Information First: Middle: _ Last: _ Gender: Male Female Birth date: _/_/ Age: _ Street Address: Town/City: _ State: Zip code:_ Child’s Home Phone:_ Parent/Guardian Contact Information First: Last:_ Ms. Mrs. Mr. Other: _ Street Address: Town/City:...»

«CARIBBEAN DEVELOPMENT BANK GUIDELINES FOR PROCUREMENT UNDER BASIC NEEDS TRUST FUND SUB-PROJECTS SEPTEMBER 2013 Caribbean Development Bank P.O. Box 408 Wildey St. Michael BARBADOS, W.I. Website: http://www.caribank.org Email: info@caribank.org Telephone: (246) 431-1600 Telefax: (246) 426-7269 Caribbean Development Bank P.O. Box 408 Wildey St. Michael BARBADOS, W.I. Website: http://www.caribank.org Email: info@caribank.org Telephone: (246) 431-1600 Telefax: (246) 426-7269 ABBREVIATIONS BNTF Basic...»

«OPPORTUNITIES in Beauty and Modeling Careers SUSAN WOOD GEARHART Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Manufactured in the United States of America. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written permission of the publisher. 0-07-145459-4 The material in this eBook also appears in...»

«Fall 2013 MATH 127 Online University of Waterloo MATH 127 : Calculus 1 for the Sciences Fall 2013 (Online Section) Barbara Forrest Instructor: baforrest@uwaterloo.ca Email: Monday – Friday via email; on-campus office hours are not available Online Office Hours: Course Description: This section of the Math 127 course (Calculus I for the Sciences) is the online version of the core first course in calculus offered to students at the University of Waterloo. It is intended to prepare students to...»

«Der phantastische Bücherbrief November 2008 unabhängig kostenlos Ausgabe 503 Deutsche Phantastik Ulrike Schweikert Lycana cbt Verlag Georg Miesen Sommer der Hexen KBV Verlag Frank Bardelle Amandus Greif und die Spur der Prinzen Otherworld Verlag Herausgeber Frank Bardelle Weltenweber Lerato Verlag Boris Koch Der Drachenflüsterer Wilhelm Heyne Verlag Armin Rößler Argona Wurdack Verlag Stephan Russbült Der Rubin der Oger Bastei Lübbe Verlag Internationale Phantastik Fantasy classic Die...»

«At-Large Division Nominee: Dr. Karen J. Berman Institution: Georgia College and State University, Milledgeville, GA Institutional affiliation and noteworthy positions (limit 50 words): I am Theatre Chair and Artistic Director at Georgia College, Georgia’s Public Liberal Arts University. I am a fellow and Dean Elect of the College of Fellows of the American Theatre, Past President of the Association for Theatre in Higher Education, and Artistic Director of Washington Women in Theatre....»

<<  HOME   |    CONTACTS
2016 www.dis.xlibx.info - Thesis, dissertations, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.