Real-time 3D Eyelids Tracking from Semantic Edges

Real-time 3D Eyelids Tracking from Semantic Edges

QUAN WEN, FENG XU, MING LU, and JUN-HAI YONG, Tsinghua University

Fig. 1. We present an approach to reconstruct and track 3D eyelids in real time. This technique is integrated into a face and eyeball tracking system to obtain

full face results with more detailed eye regions, again in real time (blue rectangles show closeups of the eye region). Our technique successfully reconstructs

both eyelid shapes and poses, for instance the shapes of eye contours, double-folds and bulges in the center-left result and the pose differences between the

two eyes in the right-most result.

State-of-the-art real-time face tracking systems still lack the ability to realistically portray subtle details of various aspects of the face, particularly

the region surrounding the eyes. To improve this situation, we propose a

technique to reconstruct the 3D shape and motion of eyelids in real time. By

combining these results with the full facial expression and gaze direction,

our system generates complete face tracking sequences with more detailed

eye regions than existing solutions in real-time. To achieve this goal, we propose a generative eyelid model which decomposes eyelid variation into two

low-dimensional linear spaces which efficiently represent the shape and motion of eyelids. Then, we modify a holistically-nested DNN model to jointly

perform semantic eyelid edge detection and identification on images. Next,

we correspond vertices of the eyelid model to 2D image edges, and employ

polynomial curve fitting and a search scheme to handle incorrect and partial

edge detections. Finally, we use the correspondences in a 3D-to-2D edge

fitting scheme to reconstruct eyelid shape and pose. By integrating our fast

fitting method into a face tracking system, the estimated eyelid results are

seamlessly fused with the face and eyeball results in real time. Experiments

show that our technique applies to different human races, eyelid shapes, and

eyelid motions, and is robust to changes in head pose, expression and gaze

direction.

CCS Concepts: ? Computing methodologies ¡ú Motion capture;

Additional Key Words and Phrases: facial performance capture, eyelid modeling, eyelid tracking, semantic edge detection

This work was supported by the NSFC (No. 61671268, 61672307, 61727808) and the

National Key Technologies R&D Program of China (No. 2015BAF23B03). Quan Wen,

Feng Xu (corresponding author), Jun-Hai Yong are from TNList and School of Software.

Ming Lu is from TNList and Department of Electronic Engineering.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specific permission and/or a

fee. Request permissions from permissions@.

? 2017 Association for Computing Machinery.

0730-0301/2017/11-ART193 $15.00



ACM Reference Format:

Quan Wen, Feng Xu, Ming Lu, and Jun-Hai Yong. 2017. Real-time 3D Eyelids Tracking from Semantic Edges. ACM Trans. Graph. 36, 6, Article 193

(November 2017), 11 pages.



1

INTRODUCTION

The human face is often the most important body part for a computer

to track as it conveys identity and emotion. Thus, facial capture and

animation is an important research topic in computer graphics, with

applications across movies, computer games, and online communications. Existing techniques reconstruct the face in real time using

consumer-level RGB or RGBD sensors, which makes obtaining facial

expression cheap and fast [Bouaziz et al. 2013; Cao et al. 2014a, 2013;

Li et al. 2013; Weise et al. 2011]. However, state-of-the-art methods

majorly focus on face skin regions and are still unable to realistically

convey many subtle expressions, such as the shape and motion of

the eyes¡ªthe window to the soul.

Many recent efforts have improved tracking for specific facial

organs, including eyes [B¨¦rard et al. 2016, 2014], eyelids [Bermano

et al. 2015], lips [Edwards et al. 2016; Garrido et al. 2016], and teeth

[Wu et al. 2016]. These techniques produce high-quality modeling,

but are too complex to be applied in real-time. Recently, real-time

eyeball modeling and tracking has been achieved [Thies et al. 2016b;

Wang et al. 2016; Wen et al. 2016] in face tracking and animation

systems. However, this alone is often insufficient for realistic capture.

Modeling eyelid shape and motion, including folds and bulges, is

still required to generate realistic eye regions in real time.

It is difficult to model the shape and motion of local eye regions

because they are small and their motions involve heavy occlusions

in fold regions. This is unlike capturing the full facial expression

where surfaces are larger and less often occluded. Recent works have

used principle component analysis (PCA) to model the overall shape

ACM Transactions on Graphics, Vol. 36, No. 6, Article 193. Publication date: November 2017.

193:2 ?

Quan Wen, Feng Xu, Ming Lu and Jun-Hai Yong

Fig. 2. Overview of our system. Note that the depth input is required by the face fitting technique, not our eyelid reconstruction. We show it here as it

contributes to the full face results shown in the figure.

of the eye region from 22 scans [Wood et al. 2016a,b]. However, eye

reconstruction should include subtle details like folds, eye contours,

and bulges, and real-time performance requires efficient estimation

of their motions. Shape from shading can be used to reconstruct

these details [Garrido et al. 2013; Richardson et al. 2016; Shi et al.

2014], but these systems are offline. In online systems, only midscale facial features, like strong wrinkles, can be reconstructed [Cao

et al. 2015].

In this paper, we propose a system to reconstruct more detailed

eye regions in real time. First, we propose a high-fidelity generative

eyelid model with a set of dimensions which independently represent the shapes and motions of eye contours, folds, and bulges. The

model is a linear parametric model which efficiently reconstructs

eye regions via linear combinations. Second, we propose projective

fitting of semantic eye region edges to reconstruct eye regions in

real time. The semantic eye region edges are extracted by a multichannel holistically-nested edge detector, which jointly achieves

edge detection and identification. Third, we propose a polynomial

curve fitting technique and a correspondence updating technique to

handle incorrect and missing edge results, and to achieve real-time

performance of the projective fitting.

We list our specific contributions:

? Real-time reconstruction, tracking, and animation of realistic

human eyelid shape and motion. It is combined with face and

eyeball tracking to generate more complete and vivid result.

? A linear parametric eyelid model for human eyelid shape and

motion. It represents details and fits real-time applications.We

will release the model to the community.

? A real-time semantic edge-based eyelid fitting solution. The

edge detector uses semantic information in a holisticallynested DNN model. Detailed eyelids are reconstructed in real

time by a novel 3D-to-2D projective edge fitting algorithm.

1.1

Related Work

We propose a more detailed eyelid model to represent eyelid shape

and motion, and use this model to achieve real-time eyelid reconstruction. As there is no previous work focusing on exactly the same

goal, we briefly survey models and reconstruction techniques for

general faces.

ACM Transactions on Graphics, Vol. 36, No. 6, Article 193. Publication date: November 2017.

Face Models. To better model, track, and animate human faces,

generative 3D face models have been proposed which involve prior

knowledge of face geometries and motions. Through these models,

a 3D face can be represented and reconstructed in a low dimensional

space. Blanz and Vetter [1999] propose the first 3D morphable face

model from 200 face scans of different individuals with fixed neutral

expression. The morphable face model is further extended with

1000 face scans [Booth et al. 2016] and with both shape and texture [Paysan et al. 2009; Zhu et al. 2015]. Besides facial identity,

facial expression is also represented in a low dimensional space.

The blendshape model, which consists of a set of key expressions of

a particular facial identity, can be used to represent novel expressions by linear combinations of blendshapes. Blendshapes can also

be generalized to novel identities by deformation transfer [Sumner and Popovi? 2004] or example-based rigging [Li et al. 2010].

The functions of morphable model and blendshape model can also

be achieved in one model, called a multilinear model [Cao et al.

2014b; Vlasic et al. 2005], in which facial identity and expression

are independently modeled by two sets of parameters.

These models focus on the overall face, but are not designed for

specific face parts. This is crucial to attaining higher fidelity in face

modeling. In [Edwards et al. 2016], a 3D viseme model, called Jali,

is proposed to model the speech-related mouth motion. Olszewski

et al. [2016] use a new blendshape model with 29 shapes for mouth

poses to better model the mouth regions (5 of them for modeling

tongue motions). Recently, teeth have been modeled delicately with

a tooth row model and a local shape model for individual teeth [Wu

et al. 2016]. For eyeballs, a morphable model has been proposed

to achieve lightweight eye capture [B¨¦rard et al. 2016]. For modeling eyelid and eye regions, early works focused on the domain

of 2D images, which are well surveyed [Ruhland et al. 2014]. In

recent years, 3D eye region models have been proposed [Wood et al.

2016a,b], where a PCA model with 8 dimensions is derived to model

the overall shape of eye regions in a fully opened eye pose. However,

there is no 3D generative model that can represent eye shapes with

folds and bulges, let alone model the eyelid dynamics at the same

time.

Face Reconstruction. We can generate high quality face shapes

and dynamics with stereo [Beeler et al. 2011] and shape from shading (SfS) techniques [Garrido et al. 2013; Shi et al. 2014]. For eye

regions, delicate eyeball models are reconstructed by a complex

Real-time 3D Eyelids Tracking from Semantic Edges ? 193:3

Fig. 3. Folds and bulges with different poses. The bottom image is zoomed

in from the left side eye region. (a,b) shows that the fold disappears with

the eye-closing motion. (c,d) shows that the bulge disappears with the

upwards-looking motion.

capture and processing technique [B¨¦rard et al. 2014]. An eyelid

morphable model is also proposed and used for image-based eye

region animation [Wood et al. 2017]. Neog et al. [2016] use a cubic

spline curve to build an eyelid shape model to learn the correlation

between eyeball direction and eyelid motion. Besides these largescale eyelid modeling and animation techniques, detailed dynamics

of eyelid folds are also modeled and reconstructed with multi-view

input and offline processing [Bermano et al. 2015]. None of these

techniques are real time, which limits their application.

Real-time 3D face tracking and animation was first demonstrated

by Weise et al. [2011] for a specific user with a RGBD sensor. This

work was extended by Li et al. [2013] and Bouaziz et al. [2013] to

remove the specific user requirement, and by Cao et al. [2013] to

require only RGB input. Later, Cao et al. [2014a] combined these

two extensions and reconstructed medium-scale face features like

strong wrinkles [Cao et al. 2015]. Occlusion handling can improve

the robustness of facial tracking systems [Hsieh et al. 2015; Liu

et al. 2015; Saito et al. 2016]. To improve realism, attention has

now turned to face parts. Wang et al. [2016] and Wen et al. [2016]

track eyeball rotations from RGB and RGBD input in real time.

In addition, following the works of real-time face reenactment on

RGBD and RGB inputs [Thies et al. 2015, 2016a], Thies et al. [2016b]

achieve gaze retargeting for face reenactment with virtual reality

(VR) headsets. This is similar to other research on VR-based face

tracking [Li et al. 2015; Olszewski et al. 2016]. All the aforementioned

real-time systems handle neither precise eyelid shapes nor motions.

1.2

Overview

Our system (fig. 2) first proposes two linear models for representing

the eyelid shape and pose in low dimensional subspaces (section 2).

Meanwhile, we train an improved holistically-nested network to

jointly achieve detection and identification of four semantic edges

in the eye regions (section 3). In the online tracking stage, with

the input color and depth sequence, we integrate our eyelid fitting

technique into a face and eyeball tracking system introduced by

Wen et al. [2016]. As the eyelid models have been pre-aligned to the

face models, we directly perform eyelid shape and pose tracking

(section 4) in the face tracking system. By iteratively solving the

optimizations, we recover the eyelid model parameters that best fit

Fig. 4. Example bases in shape and pose rigs. (a) the basic eyelid mesh: b0id

exp

and b0 ; (b-d) three shape bases with eye contour change, fold change

id , b id and b id ; (e-g) three pose

and bulge change on the right-side eye, b11

21

23

bases with downward motions on the inner part, the outer part and the

exp

exp

exp

whole of the upper eyelid of the right-side eye, b3 , b5

and b1 .

the four edges detected on the image. Finally, as our eyelid models

are generative, we can transfer the reconstructed eyelid motions to

a novel eyelid identity in real time.

2

LINEAR EYELID MODELS

For 3D face modeling, previous works use morphable models to represent the identity/shape changes of a face, and blendshape models

to represent expression/pose changes. Similarly, we also use two

linear models, with two sets of 3D mesh bases, to represent the

shape and pose variations of eyelids independently. In this manner,

users can control the shape and pose more freely. However, for faces,

the shape and pose correspond to totally different changes on face

geometry, while for eyelids, they may cause the same changes. For

example, different identities may or may not have folds and bulges,

and different poses (e.g., opened/closed, looking forwards/upwards)

may also cause the folds and bulges to appear and disappear, as

shown in fig. 3. Our two linear rigs share identical bases to handle

this issue.

2.1

Shape Linear Rig

Eyelids vary among genders, races, and ages. The characteristics of

an eyelid can be categorized as follows:

? Position stands for the relative positions of the eyes on a face.

It is represented by the vertical location of the eye pair and

the horizontal distance between the two eyes.

? Contour shape of the eye is affected by whether the eyes are

round or flat, wide or narrow in the horizontal direction,

upturned (the outer eye corner is higher than the inner eye

corner) or downturned (the outer eye corner is lower than

the inner eye corner).

? Double-fold is described by the distances between the fold and

the upper eyelid on different parts, and the strength of the

fold. Note that a single-fold eyelid, commonly seen in Asian

populations, is an extreme case of the double-fold eyelid. It

can be represented as a zero-strength double-fold eyelid.

? Bulge indicates the shape and strength of a bulge beneath an

eye. ¡®No bulge¡¯ is represented by a zero-strength bulge.

Based on the observations above, we asked an artist to build a set

of eyelids to cover the characteristic variations. We asked the artist

ACM Transactions on Graphics, Vol. 36, No. 6, Article 193. Publication date: November 2017.

193:4 ?

Quan Wen, Feng Xu, Ming Lu and Jun-Hai Yong

Fig. 5. Detection and identification of the four semantic edges. Each color

corresponds to one kind of semantic edge (blue: fold edge; green: top edge;

red: bottom edge; purple: bulge edge). (a,b) examples with all four edges; (c)

an example without fold edge; (d) an example without bulge edge.

to first build a basic 3D eyelid mesh with a neutral identity in a neutral, open pose, and then modify the basic mesh to generate a set of

new meshes. The artist edited these meshes to cover the eyelid space

as much as possible, based on her experience and photographic reference: . The difference between each new

mesh and the base mesh represents one dimension of one characteristic, with the topology and the semantic meaning of the vertices

kept between all meshes. We use this set of eyelids to construct the

bases of our linear shape model:

B id = {bkid |k = 0, ..., N id ? 1}, N id = 29,

(1)

b0id

where

is a pair of neutral eyelids. Details of the bases in the

shape rig, along with the following pose rig, are described in the

appendix, and we show a few eyelid bases in fig. 4. Note that for

position, the two eyes always move together (upper or lower, closer

or farther), but for other characteristics, the two eyes could change

independently as they may be slightly different for most people.

Thus, our bases either relate to two eyes or to one eye.

With the shape bases and a set of blending weights, the eyelid

model of a specific user in a neutral, open pose is synthesized by:

E N = b0id +

where w id = (w 0id , ..., w idid

N

2.2

id ?1

N?

k =1

?1

w kid (bkid ? b0id ),

(2)

)T is the shape weight.

Pose Linear Rig

We build the pose rig by manually generating a set of bases to cover

all possible geometric changes caused by eyelid poses:

exp

B exp = {bk

|k = 0, ..., N exp ? 1}, N exp = 23,

(3)

exp

where b0 is the same as b0id . Note that eyelid fold and bulge may

change with either shape or pose. Thus, there are some shared bases

for both shape and pose rigs to control the eyelid fold and bulge.

In practice, we first use the shape rig to estimate E N for a particexp

ular user. However, since b0 is different from E N , B exp cannot

be directly used to construct the pose rig for E N . To overcome this

drawback, we use deformation transfer [Sumner and Popovi? 2004]

¡ä

to recover B exp by transferring the deformation gradient between

ACM Transactions on Graphics, Vol. 36, No. 6, Article 193. Publication date: November 2017.

Fig. 6. Comparisons of different solutions on edge detection and identification. (a) input images; (b) results of four separate HEDs; (c) results of our

network with separate loss defined in eq. (8); (d) results of our network with

uniform loss defined in eq. (6); (e) ground truth. Note that as both (b) and

(c) separately consider the four edges, they always detect all the four edges

no matter whether or not the images contain fold edges or bulge edges.

exp

b0

exp

and bi

to E N . Correspondingly, the pose rig is used as:

exp ¡ä

E P = b0

+

x p ?1

N e?

k =1

exp

wk

exp ¡ä

(bk

exp ¡ä

? b0

).

(4)

Here, ¡ä stands for the transferred bases.

3

EYELID EDGE DETECTION AND IDENTIFICATION

We discuss how to detect and identify the four edges for eyelid

motion capture. The four edges have different semantic meanings

representing the double-fold, the upper eyelid, the lower eyelid and

the lower boundary of the bulge, respectively, as shown in fig. 5.

For simplification, we name the four edges as fold edge, top edge,

bottom edge and bulge edge correspondingly. Note that the top edge

and the bottom edge always exist in all facial images (coincident

with each other when the eye is closed), but the fold edge and the

bulge edge may not exist for some eye shapes or poses.

Detecting and identifying the four edges is not a trivial task. Recently, DNN-based edge detection techniques have demonstrated

noticeable improvements in both detection accuracy and performance. However, in this paper, we need to not only detect but also

distinguish the four edges to perform eyelid fitting, which is not

considered in previous edge detection techniques. A naive extension

to identify the four edges is to train four edge detectors, each of

which is individually trained to detect only one of the four edges.

However, for the four edges of each eye region, their relative positions, shapes, and motions are highly correlated. Training four

detectors separately ignores those correlations and does not achieve

good results, as shown in fig. 6(b).

3.1

Network

To exploit the correlation among the four edges, we modify the

holistically-nested edge detection (HED) proposed by Xie et al. [2015]

to train a uniform network that jointly detects and identifies the

four edges. We first formulate the edge detection and identification problem as a multi-channel edge detection problem and then

propose a unified energy metric to jointly consider the four edges

together in the network training, which learns the correlations of

the four edges.

Real-time 3D Eyelids Tracking from Semantic Edges ? 193:5

The HED is based on the VGG-16 net [Simonyan and Zisserman

2014]. It connects five side-output layers to each stage of the VGG16 net. Each side-output layer generates an edge detection result

from the VGG features in the corresponding stage. Each result is

supervised by the ground-truth edge map, and all results are fused

together to generate the final output, which is also supervised by

the ground truth. In our system, since identifying the four edges

is required, we represent our output and also the ground truth as

four-channel binary edge maps, where 1 in each channel stands

for pixels on one of the four edges and 0 is used to label other

pixels. Comparing with the network in the original HED, we use

four convolution kernels, each corresponding to one channel in each

side-output layer and also in the final fused layer. In this manner, we

create a modified network that fits the representation of the output

and jointly achieves detection and identification of the four edges.

3.2

There is a more straightforward way to define L by the individual

detection errors of the four edges:

L(¦× (I, ¦È ), G) =

i=1

To train our network to learn optimal network parameters ¦È , we

formulate the loss function as follows:

arg min(¦Á f L(¦× f (I, ¦È ), G) +

¦È

M

?

¦Á sk L(¦×sk (I, ¦È ), G)).

(5)

k =1

In this function, I and G are the input eye region image and the

four-channel ground truth edge map. ¦× f and ¦×sk stand for the fused

output and the side-outputs of the network. The loss function L

computes the pixel-wise sigmoid cross-entropy loss between an

output edge map and the ground truth. M represents the number of

side-outputs in the network (5 in our experiments). ¦Á f and ¦Á sk are

the weight parameters for the fused loss and the side losses.

We define a unified loss L that integrates the detection and identification errors of the four edges together:

?

L(¦× (I, ¦È ), G) = ?¦Â

logPr(§Õj = 0|I ; ¦È )

j ¡ÊG +

?(1 ? ¦Â)

4 ?

?

i=1 j ¡ÊG ?i

logPr(§Õij = 1|I ; ¦È ),

(6)

?(1 ? ¦Â )

¦Â=

4

?

i

G?

/|G | .

(7)

i=1

i is the set of pixels belonging to the ith edge in the ground truth,

G?

and thus ¦Â stands for the ratio of the number of edge pixels in the

ground truth to the number of all pixels in the eye region image.

G + is the union of all the four set of non-edge pixels in the ground

i . Pr(§Õi = 1|I ; ¦È ) stands for the

truth and each set is defined as G +

j

probability of pixel j belonging to the ith edge in an output, while

Pr(§Õj = 0|I ; ¦È ) stands for the probability of belonging to non-edges.

Pr = ¦Ò (a j ), where ¦Ò is the sigmoid function and a j is the activation

value of pixel j in our DNN. With the definition of eq. (6), minimizing

eq. (5) leads to a network generating desired output. One example

of the input, ground truth, and the output of our network is shown

in fig. 6.

?

logPr(§Õij

(8)

= 1|I ; ¦È )),

j ¡ÊG ?i

where

i

¦Â i = G?

/|G | .

(9)

However, in this definition, the four edges independently contribute

to different terms, thus their correlations are not considered in the

loss. As a consequence, compared with eq. (6), eq. (8) generates

incorrect results as shown in fig. 6(c).

Training

Our training set contains 194 eye region images from 48 identities

and the corresponding four-channel ground truth edge maps (manually labeled). The images of 5 identities are recorded by ourselves

and another 3 are from the Eyediap database [Mora et al. 2014].

These 8 identities contribute 57 facial images, each of which provides two eye region images. For each identity, our data set contains

images with different eyelid poses. The pose changes are caused by

eyelid motions, as well as gaze motions, because eyelids change with

eyeball movements. The remaining 40 identities are collected from

the Internet, each of which has one image. These have no eyelid

pose change, and they contribute 80 training samples in total. Note

that the fold edge and the bulge edge do not exist in some images:

some subjects may not have double eyelids or eye bulges; the top

edge may disappear when subjects look downwards or close their

eyes. In these cases, we do not label fold edges or bulge edges.

In the training, the network is fine-tuned from an initialization of

the pre-trained VGG-16 net model, and we use a standard stochastic

gradient descent algorithm in the training. The parameters in the

training are set as follows: learning rate (1e-6), momentum (0.9),

weight decay (0.0002). The weight parameters ¦Á f and ¦Á sk are all 1.

4

where

j ¡ÊG +i

i

3.3

Loss

4

?

?

(?¦Â i

logPr(§Õij = 0|I ; ¦È )

CURVE-BASED EYELID RECONSTRUCTION

We discuss how to reconstruct the 3D eyelids of a user with our

linear eyelid models and the four edges identified on each recorded

frame. The core idea is to estimate the optimal shape and pose

weights in our linear eyelid models, aiming to minimize the inconsistency between the projected eyelid edges and the real eyelid

edges on the image. To construct the optimization problem, we need

to define the correspondences between the 3D eyelid model and

2D pixels. The edges of the eyelid model are defined by a set of

manually-labeled mesh vertices, called 3D eyelid landmarks. Their

corresponding pixels, called 2D eyelid landmarks, are extracted by

first fitting four polynomial curves to the detected edges on the

edge map (section 4.1), and then determining their locations on the

curves (section 4.2). With the correspondences, we minimize an

energy function that measures the distances between the projected

3D landmarks and their corresponding 2D landmarks (section 4.3).

Note that our eyelid reconstruction method is integrated into a

real-time face tracking system [Wen et al. 2016], which reconstructs

ACM Transactions on Graphics, Vol. 36, No. 6, Article 193. Publication date: November 2017.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download