Skip to main content

Face Transfer with Multilinear Models

Reading NotesResearchFace ModelAbout 5 minAbout 1523 words

TL;DR

  • Multilinear Models
  • Formula — See Sec. 4
  • Building
    • Input — Motion Capture ( vertices \cp\cp ( \cp\cp visemes ) expressions \cp\cp identities )
    • Correspondence — Meshes --> Scan ( See Sec. 4.2 )
    • Missing Data — See Sec. 4.4
    • Method — SVD
  • Face Transfer
    • Input — Video Data
    • Face Tracking ( See Sec. 5.1 )
    • Initialization

Abstract

Face Transfer is a method for mapping video recorded performances of one individual to facial animations of another. It extracts visemes (speech-related mouth articulations), expressions, and three-dimensional (3D) pose from monocular video or film footage. These parameters are then used to generate and drive a detailed 3D textured face mesh for a target identity, which can be seamlessly rendered back into target footage. The underlying face model automatically adjusts for how the target performs facial expressions and visemes. The performance data can be easily edited to change the visemes, expressions, pose, or even the identity of the target—the attributes are separably controllable. This supports a wide variety of video rewrite and puppetry applications.

Face Transfer is based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes (e.g., identity, expression, and viseme). Separability means that each of these attributes can be independently varied. A multilinear model can be estimated from a Cartesian product of examples (identities ×\times expressions ×\times visemes) with techniques from statistical analysis, but only after careful preprocessing of the geometric data set to secure one-to-one correspondence, to minimize cross-coupling artifacts, and to fill in any missing examples. Face Transfer offers new solutions to these problems and links the estimated model with a face-tracking algorithm to extract pose, expression, and viseme parameters.

1. Introduction

  • multilinear model decouples the three attributes
  • separability & consistency

3. Multilinear Algebra

Tensors

T=(ti1i2iN) \mathscr{T} = \pqty{t_{i_1 i_2 \dots i_N}}

Mode Spaces

Mode-nn Product

  • T\mathscr{T} — tensor
  • \vbM\vb{M} — matrix

T\cpn\vbM\mathscr{T} \cp_n \vb{M} indicates a linear transformation of vectors in T\mathscr{T}'s mode-n space by the matrix \vbM\vb{M}. T\cp2\vbM\mathscr{T} \cp_2 \vb{M} would replace each mode-2 vector \vbv\vb{v} with a transformed vector \vbM\vbv\vb{M} \vb{v}.

Tensor Decomposition

T\cp1\vbU1\cp2\vbU2\cp3\vbU3\cpN\vbUN=CT=C\cp1\vbU1\cp2\vbU2\cp3\vbU3\cpN\vbUN \begin{align*} & \mathscr{T}\cp_1 \vb{U}_1^{\top} \cp_2 \vb{U}_2^{\top} \cp_3 \vb{U}_3^{\top} \dots \cp_N \vb{U}_N^{\top} = \mathscr{C} \\ \Longrightarrow \quad & \mathscr{T} = \mathscr{C} \cp_1 \vb{U}_1 \cp_2 \vb{U}_2 \cp_3 \vb{U}_3 \dots \cp_N \vb{U}_N \end{align*}

  • T\mathscr{T} — data tensor
  • C\mathscr{C} — core tensor
  • \vbUi\vb{U}_i — orthonormal matrix whose columns contain left singular vectors of the ii-th mode space, and can be computes via SVD

TCreduces\cp1\vbUˇ1\cp2\vbUˇ2\cp3\vbUˇ3\cpN\vbUˇN \mathscr{T} \simeq \mathscr{C}_{reduces} \cp_1 \check{\vb{U}}_1 \cp_2 \check{\vb{U}}_2 \cp_3 \check{\vb{U}}_3 \dots \cp_N \check{\vb{U}}_N

  • \vbUˇi\check{\vb{U}}_i — truncated versions of \vbUi\vb{U}_i with last few columns removed

4. Multilinear Face Model

4.1. Face Data

Bilinear model

vertices ×\times expressions ×\times identities

Trilinear model

vertices ×\times visemes ×\times expressions ×\times identities

4.2. Correspondence

manually specified 42 reference points

  1. align the template and the scan
  2. deform the template mesh into the scan
    1. at first, weighing the marked correspondences heavily
    2. afterward emphasizing vertex proximity

4.3. Face Model

TM\cp2\vbUˇ2\cp3\vbUˇ3\cpN\vbUˇN\labeleq:4 \begin{equation} \mathscr{T} \simeq \mathscr{M} \cp_2 \check{\vb{U}}_2 \cp_3 \check{\vb{U}}_3 \dots \cp_N \check{\vb{U}}_N \label{eq:4} \end{equation}

\vbf=M\cp2\vbw2\cp3\vbw3\cpN\vbWN \vb{f} = \mathscr{M} \cp_2 \vb{w}_2^{\top} \cp_3 \vb{w}_3^{\top} \dots \cp_N \vb{W}_N^{\top}

  • M\mathscr{M}multilinear model of face geometry
  • \vbwi\vb{w}_i — a column vector of parameters ( weights ) for the attribute corresponding to ithi^{th} mode
  • \vbf\vb{f} — a column vector of vertices describing the resulting face

4.4. Missing Data

Description

  1. For each model we assemble an incomplete matrix whose columns are the corresponding mode vectors. We then seek a subspace decomposition that best reconstructs the known values of that matrix. ( PPCA )
  2. The linear constraints are combined through the missing elements, because they are shared across all groups of modal vectors and must be filled in with consistent values. To that end, we collect linear equations that determine a particular missing value in all the modes, and solve them together.

Evaluation

Probabilistic Interpretation

T=M\cp2\vbUˇ2\cp3\vbUˇ3\cpN\vbUˇN+ν \mathscr{T} = \mathscr{M} \cp_2 \check{\vb{U}}_2 \cp_3 \check{\vb{U}}_3 \dots \cp_N \check{\vb{U}}_N + \nu

  • T,M\mathscr{T}, \mathscr{M} — data and model tensors
  • \vbUˇi\check{\vb{U}}_i — the ii-th modal subspace
  • ν\nu — a Gaussian noise source

p(TM,\Bqty\vbUˇii=2N)j=2Nqj(T,M,\Bqty\vbUˇii=2N)1/N p\pqty{\mathscr{T} \vert \mathscr{M}, \Bqty{\check{\vb{U}}_i}_{i = 2}^N} \approx \prod_{j = 2}^N q_j\pqty{\mathscr{T}, \mathscr{M}, \Bqty{\check{\vb{U}}_i}_{i = 2}^N}^{1 / N}

Each Gaussian qj(T,M,\Bqty\vbUˇii=2N)N(T\vbUˇj\vbJj,σj2)q_j\pqty{\mathscr{T}, \mathscr{M}, \Bqty{\check{\vb{U}}_i}_{i = 2}^N} \doteq \mathscr{N}\pqty{\mathscr{T} \vert \check{\vb{U}}_j \vb{J}_j, \sigma_j^2} is found by fixing \Bqty\vbUˇjij\Bqty{\check{\vb{U}}_j}_{i \neq j} and turning the tensor Equation \eqrefeq:4\eqref{eq:4} into matrix form: \vbTj=\vbUˇj\vbJj\vb{T}_j = \check{\vb{U}}_j \vb{J}_j. Here, columns of \vbTj\vb{T}_j are the mode-jj vectors of T\mathscr{T}, and the columns of \vbJj\vb{J}_j are the mode-jj vectors of M\cp2\vbUˇ2\cpj1\vbUˇj1\cpj+1\vbUˇj+1\cpN\vbUˇN\mathscr{M} \cp_2 \check{\vb{U}}_2 \dots \cp_{j - 1} \check{\vb{U}}_{j - 1} \cp_{j + 1} \check{\vb{U}}_{j + 1} \dots \cp_N \check{\vb{U}}_N. The resulting likelihood becomes:

p(TM,\Bqty\vbUˇii=2N)j=2NN(T\vbUˇj\vbJj,σj2)1/N p\pqty{\mathscr{T} \vert \mathscr{M}, \Bqty{\check{\vb{U}}_i}_{i = 2}^N} \approx \prod_{j = 2}^N \mathscr{N}\pqty{\mathscr{T} \vert \check{\vb{U}}_j \vb{J}_j, \sigma_j^2}^{1 / N}

Taking logarithms and discarding constant factors, we seek to minimize the sum-squared error

j=2N\norm\vbTj\vbUˇj\vbJjF2 \sum_{j = 2}^N \norm{\vb{T}_j - \check{\vb{U}}_j \vb{J}_j}_F^2

Each term of the summation presents a matrix factorization problem with missing values, where \vbUˇj\check{\vb{U}}_j and \vbJj\vb{J}_j are treated as unknown factors of the incomplete matrix \vbTj\vb{T}_j, and are solved for using PPCA.

5. Face Transfer

5.1. Face Tracking

optical flow + weak-perspective camera model

Using the symmetric Kanade-Lucas-Tomasi formulation, we express the frame-to-frame motion of a tracked point with a linear system:

\vbZ\vbd=\vbZ(\vbp\vbp0)=\vbe \vb{Z} \vb{d} = \vb{Z} \pqty{\vb{p} - \vb{p}_0} = \vb{e}

  • \vbd\vb{d} — describe the image-space motion of the point
  • \vbp\vb{p} — the point’s true location
  • \vbp0\vb{p}_0 — its current best guess ( location from the previous frame if no guess )
  • matrix \vbZ\vb{Z}, vector \vbe\vb{e} — contain spatial and temporal intensity gradient information in the surrounding region

Using a weak-perspective imaging model, the point position \vbp\vb{p} can be expanded in terms of rigid head-motion parameters and non-rigid facial shape parameters, which are constrained by the multilinear model:

\vbZ(s\vbR\vbfi+\vbt\vbp0)=\vbe\labeleq:7 \begin{equation} \vb{Z} \pqty{s \vb{R} \vb{f}_i + \vb{t} - \vb{p}_0} = \vb{e} \label{eq:7} \end{equation}

  • ss — scale factor
  • \vbR\vb{R} — 3D rotation matrix
  • \vbt\vb{t} — image-space translation
  • \vbfi\vb{f}_i — the ii-th 3D vertex being tracked

Solving for the pose and all the multilinear weights from a pair of frames using Equation \eqrefeq:7\eqref{eq:7} is not a well-constrained problem. To simplify the computation, we use a coordinate-descent method: we let only one of the face attributes vary at a time by fixing all the others to their current guesses. This transforms the multilinear problem into a linear one.

\vbZ(s\vbR\vbMm,i\vbwm+\vbt\vbp0)=\vbe \vb{Z} \pqty{s \vb{R} \vb{M}_{m, i} \vb{w}_m + \vb{t} - \vb{p}_0} = \vb{e}

  • mm — the mode corresponding to non-fixed attribute

  • \vbwm\vb{w}_m — a vector of weights for that attribute

  • \vbMm,i\vb{M}_{m, i} — corresponding linear basis for the tracked vertex ii obtained from

    \vbMm=M\cp2\vbw2\cp(m1)\vbw(m1)\cp(m+1)\vbw(m+1)\cpN\vbwN \vb{M}_m = \mathscr{M} \cp_2 \vb{w}_2^{\top} \dots \cp_{\pqty{m - 1}} \vb{w}_{\pqty{m - 1}}^{\top} \cp_{\pqty{m + 1}} \vb{w}_{\pqty{m + 1}}^{\top} \dots \cp_N \vb{w}_N^{\top}

  • If the currently tracked attribute varies from frame to frame ( e.g. expression ), we solve the set of linear systems and proceed to the next pair of neighboring frames.

  • If the attribute is constant across all frames ( e.g. identity ), we accumulate the mentioned linear systems from each pair of frames and solve them together as one combined system.

5.2. Initialization

  • specify a small number of feature points which are then used to position the face geometry
  • correspondences: user-provided or automatically detected

8. Conclusion

  • estimate a highly detailed face model from an incomplete set of face scans
  • multilinear model
  • separability: different attributes, such as identity and expression, can be manipulated independently