Skip to main content

Learning a model of facial shape and expression from 4D scans

Reading NotesResearchFace ModelAbout 5 minAbout 1622 words

3. Model formulation

FLAME is described by a function:

M(β,θ,ψ):R\absβ×\absθ×\absψR3N M\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}}: \mathbb{R}^{\abs{\vec{\beta}} \times \abs{\vec{\theta}} \times \abs{\vec{\psi}}} \to \mathbb{R}^{3 N}

  • β\vec{\beta} — coefficients describing shape
  • θ\vec{\theta} — coefficients describing pose
  • ψ\vec{\psi} — coefficients describing expression
  • \vbTR3N\overline{\vb{T}} \in \mathbb{R}^{3 N} — template mesh in the “zero pose” θ\vec{\theta}^*
  • θ\vec{\theta}^* — “zero pose”
  • BS(β;S):R\absβR3NB_S\pqty{\vec{\beta}; \mathcal{S}}: \mathbb{R}^{\abs{\vec{\beta}}} \to \mathbb{R}^{3 N} — shape blendshape function to account for identity related shape variation
  • BP(θ;P):R\absθR3NB_P\pqty{\vec{\theta}; \mathcal{P}}: \mathbb{R}^{\abs{\vec{\theta}}} \to \mathbb{R}^{3 N} — corrective pose blendshapes to correct pose deformations that cannot be explained solely by LBS
  • BE(ψ;E):R\absψR3NB_E\pqty{\vec{\psi}; \mathcal{E}}: \mathbb{R}^{\abs{\vec{\psi}}} \to \mathbb{R}^{3 N} — expression blendshapes that capture facial expressions
  • W(\vbT,\vbJ,θ,W)W\pqty{\overline{\vb{T}}, \vb{J}, \vec{\theta}, \mathcal{W}} — A standard skinning function is applied to rotate the vertices of \vbT\overline{\vb{T}} around joints \vbJR3K\vb{J} \in \mathbb{R}^{3 K}, linearly smoothed by blendweights WRK×N\mathcal{W} \in \mathbb{R}^{K \times N}

M(β,θ,ψ)=W(TP(β,θ,ψ),\vbJ(β),θ,W)\labeleq:1 \begin{equation} M\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}} = W\pqty{T_P\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}}, \vb{J}\pqty{\vec{\beta}}, \vec{\theta}, \mathcal{W}} \label{eq:1} \end{equation}

TP(β,θ,ψ)=\vbT+BS(β;S)+BP(θ;P)+BE(ψ;E) T_P\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}} = \overline{\vb{T}} + B_S\pqty{\vec{\beta}; \mathcal{S}} + B_P\pqty{\vec{\theta}; \mathcal{P}} + B_E\pqty{\vec{\psi}; \mathcal{E}}

\vbJ(β;T,\vbT,S)=T(\vbT+BS(β;S)) \vb{J}\pqty{\vec{\beta}; \mathscr{T}, \overline{\vb{T}}, \mathcal{S}} = \mathscr{T}\pqty{\overline{\vb{T}} + B_S\pqty{\vec{\beta}; \mathcal{S}}}

  • T\mathscr{T} — a sparse matrix defining how to compute joint locations from mesh vertices

Shape blendshapes

BS(β;S)=n=1\absββn\vbSn B_S\pqty{\vec{\beta}; \mathcal{S}} = \sum_{n = 1}^{\abs{\vec{\beta}}} \beta_n \vb{S}_n

  • β=\bqtyβ1,,β\absβT\vec{\beta} = \bqty{\beta_1, \dots, \beta_{\abs{\vec{\beta}}}}^T — shape coefficients
  • S=\bqty\vbS1,,\vbSβR3N×\absβ\mathcal{S} = \bqty{\vb{S}_1, \dots, \vb{S}_{\vec{\beta}}} \in \mathbb{R}^{3 N \times \abs{\vec{\beta}}} — orthonormal shape basis, which will be learned below with PCA

Pose blendshapes

BP(θ;P)=n=19K(Rn(θ)Rn(θ))\vbPn B_P\pqty{\vec{\theta}; \mathcal{P}} = \sum_{n = 1}^{9 K} \pqty{R_n\pqty{\vec{\theta}} - R_n\pqty{\vec{\theta}^*}} \vb{P}_n

  • R(θ):R\absθR9KR\pqty{\vec{\theta}}: \mathbb{R}^{\abs{\vec{\theta}}} \to \mathbb{R}^{9 K} — a function from a face / head / eye pose vector θ\vec{\theta} to a vector containing the concatenated elements of all the corresponding rotation matrices
  • Rn(θ),Rn(θ)R_n\pqty{\vec{\theta}}, R_n\pqty{\vec{\theta}^*}nn-th element of R(θ)R\pqty{\vec{\theta}} and R(θ)R\pqty{\vec{\theta}^*}
  • vector \vbPnR3N\vb{P}_n \in \mathbb{R}^{3 N} — vertex offsets from the rest pose activated by RnR_n
  • P=\bqty\vbP1,,\vbP9KR3N×9K\mathcal{P} = \bqty{\vb{P}_1, \dots, \vb{P}_{9 K}} \in \mathbb{R}^{3 N \times 9 K} — pose space, a matrix containing all pose blendshapes

Expression blendshapes

BE(ψ;E)=n=1\absψψn\vbEn\labeleq:5 \begin{equation} B_E\pqty{\vec{\psi}; \mathcal{E}} = \sum_{n = 1}^{\abs{\vec{\psi}}} \vec{\psi}_n \vb{E}_n \label{eq:5} \end{equation}

  • ψ=\bqtyψ1,,ψ\absψT\vec{\psi} = \bqty{\psi_1, \dots, \psi_{\abs{\vec{\psi}}}}^T — expression coefficients
  • E=\bqty\vbE1,,\vbE\absψR3N×\absψ\mathcal{E} = \bqty{\vb{E}_1, \dots, \vb{E}_{\abs{\vec{\psi}}}} \in \mathbb{R}^{3 N \times \abs{\vec{\psi}}} — orthonormal expression basis

Template shape

4. Temporal registration

4.1. Initial model

Shape

Pose

Expression

4.2. Single-frame registration

Model-only

estimate the model coefficients \Bqtyβ,θ,ψ\Bqty{\vec{\beta}, \vec{\theta}, \vec{\psi}} by optimizing

E(β,θ,ψ)=ED+λLEL+EP E\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}} = E_D + \lambda_L E_L + E_P

ED=λD\vbvsρ(min\vbvmM(β,θ,ψ)\norm\vbvs\vbvm) E_D = \lambda_D \sum_{\vb{v}_s} \rho \pqty{\min_{\vb{v}_m \in M\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}}} \norm{\vb{v}_s - \vb{v}_m}}

  • EDE_D — measures the scan-to-mesh distance of the scan vertices \vbvs\vb{v}_s and the closest point in the surface of the model
  • \vbvs\vb{v}_s — scan vertices
  • λD\lambda_D — weight controls the influence of the data term
  • ρ\rho — a Geman-McClure robust penalty function
  • ELE_L — a landmark term, measuring the L2-norm distance between image landmarks and corresponding vertices on the model template, projected into the image using the known camera calibration

EP=λθEθ+λβEβ+λψEψ E_P = \lambda_{\vec{\theta}} E_{\vec{\theta}} + \lambda_{\vec{\beta}} E_{\vec{\beta}} + \lambda_{\vec{\psi}} E_{\vec{\psi}}

  • EPE_P — regularizes the pose coefficients θ\vec{\theta}, shape coefficients β\vec{\beta}, and expression coefficients ψ\vec{\psi} to be close to zero by penalizing their squared values

Coupled

allow the optimization to leave the model space by optimizing

E(\vbT,β,θ,ψ)=ED+EC+ER+EP\labeleq:9 \begin{equation} E\pqty{\vb{T}, \vec{\beta}, \vec{\theta}, \vec{\psi}} = E_D + E_C + E_R + E_P \label{eq:9} \end{equation}

  • \vbT\vb{T} — template mesh

  • EDE_D — measures the scan-to-mesh distance from the scan to the aligned mesh \vbT\vb{T}

  • ECE_C — constrains \vbT\vb{T} to be close to the current statistical model by penalizing edge differences between \vbT\vb{T} and the model M(β,θ,ψ)M\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}} as

    EC=eλe\norm\vbTeM(β,θ,ψ)e E_C = \sum_e \lambda_e \norm{\vb{T}_e - M\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}}_e}

  • \vbTe,M(β,θ,ψ)e\vb{T}_e, M\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}}_e — edges of \vbT\vb{T} and M(β,θ,ψ)M\pqty{\vec{\beta}, \vec{\theta}, \vec{\psi}}

  • λe\lambda_e — an individual weight assigned to each edge

ER=1Nk=1Nλk\normU(\vbvk)2 E_R = \frac{1}{N} \sum_{k = 1}^N \lambda_k \norm{U\pqty{\vb{v}_k}}^2

  • ERE_R — regularization term for each vertex \vbvkR3\vb{v}_k \in \mathbb{R}^3 in \vbT\vb{T}
  • U(\vbv)=\vbvrN(\vbv)\vbvr\vbv\absN(\vbv)U\pqty{\vb{v}} = \frac{\sum_{\vb{v}_r \in \mathcal{N}\pqty{\vb{v}}} \vb{v}_r - \vb{v}}{\abs{\mathcal{N}\pqty{\vb{v}}}}
  • N(\vbv)\mathcal{N}\pqty{\vb{v}} — the set of vertices in the one-ring neighborhood of \vbv\vb{v}

Texture-based

E(\vbT,β,θ,ψ)=ED+EC+λTET+ER+EP E\pqty{\vb{T}, \vec{\beta}, \vec{\theta}, \vec{\psi}} = E_D + E_C + \lambda_T E_T + E_R + E_P

  • ETE_T — measures the photometric error between real image II and the rendered textured image I^\hat{I} of \vbT\vb{T} from all VV views

ET=l=03v=1V\normΓ(Il(v))Γ(I^l(v))F2 E_T = \sum_{l = 0}^3 \sum_{v = 1}^V \norm{\Gamma\pqty{I_l^{\pqty{v}}} - \Gamma\pqty{\hat{I}_l^{\pqty{v}}}}_F^2

  • \norm\vbXF\norm{\vb{X}}_F — Frobenius norm of \vbX\vb{X}
  • Γ\Gamma — ratio of Gaussian filters help minimize the influence of lighting changes between real and rendered images
  • Il(v)I_l^{\pqty{v}} — the image II of resolution level ll from view vv

4.3. Sequential registration

Personalization

  • use a coupled registration ( Equation \eqrefeq:9\eqref{eq:9} ) and average the results \vbTi\vb{T}_i across multiple sequences to get a personalized template for each subject
  • randomly select one of the \vbT\vb{T} for each subject to generate a personalized texture map

Sequence fitting

  • replace the generic model template \vbT\overline{\vb{T}} in MM \eqrefeq:1\eqref{eq:1} by personalized template
  • fix the β\vec{\beta} to zero
  • initialize the model parameters from the previous frame and use the single-frame registration 4.2.

6. Model training

decouple shape, pose, and expression variations

  • \BqtyP,W,T\Bqty{\mathcal{P}, \mathcal{W}, \mathcal{T}} — pose parameters
  • E\mathcal{E} — expression parameters
  • \Bqty\vbT,S\Bqty{\overline{\vb{T}}, \mathcal{S}} — shape parameters

6.1. Pose parameter training

  • \vbTiP\vb{T}_i^P — personalized rest-pose templates
  • \vbJiP\vb{J}_i^P — person specific joints
  • W\mathcal{W} — blendweights
  • P\mathcal{P} — pose blendshapes
  • T\mathcal{T} — joint regressor

alternate between:

  • solve for the pose parameters θj\vec{\theta}_j of each registration jj
  • optimize the subject specific parameters \Bqty\vbTiP,\vbJiP\Bqty{\vb{T}_i^P, \vb{J}_i^P}
  • optimize the global parameters \BqtyW,P,T\Bqty{\mathcal{W}, \mathcal{P}, \mathcal{T}}

objective function being optimized consists of:

  • data term EDE_D — penalizes the squared Euclidean reconstruction error of the training data
  • regularization term EPE_{\mathcal{P}} — penalizes the Frobenius norm of the pose blendshapes
  • regularization term EWE_{\mathcal{W}} — penalizes large deviations of the blendweights from their initialization

To avoid \vbTiP\vb{T}_i^P and \vbJiP\vb{J}_i^P being affected by strong facial expressions, expression effects are removed when solving for \vbTiP\vb{T}_i^P and \vbJiP\vb{J}_i^P. This is done by jointly solving for pose θ\vec{\theta} and expression parameters ψ\vec{\psi} for each registration, subtracting BEB_E (Equation \eqrefeq:5\eqref{eq:5}), and solving for \vbTiP\vb{T}_i^P and \vbJiP\vb{J}_i^P on those residuals.

6.2. Expression parameter training

  • solve for the pose parameters θj\vec{\theta}_j of each registration
  • unpose: remove the pose influence by applying the inverse transformation entailed by M(0,θ,0)M\pqty{\vec{0}, \vec{\theta}, \vec{0}} (Equation \eqrefeq:1\eqref{eq:1})
  • \vbVjU\vb{V}_j^U — the vertices resulting from unposing the registration jj
  • \vbViNE\vb{V}_i^{NE} — the vertices of the neutral expression of subject ii, also unposed
  • compute expression residuals \vbVjU\vbVs(j)NE\vb{V}_j^U - \vb{V}_{s\pqty{j}}^{N E} for each registration jj
  • s(j)s\pqty{j} — the subject index jj
  • compute expression space E\mathcal{E} by applying PCA

6.3. Shape parameter training

  • \vbT\overline{\vb{T}} — computed as the mean of these expression- and pose-normalized registrations
  • S\mathcal{S} — formed by the first β\vec{\beta} principal components computed using PCA

6.4. Optimization structure

Due to the high capacity and flexibility of the expression space formulation, pose blendshapes should be trained before expression parameters in order to avoid expression overfitting.