论文阅读:【AU Intensity Estimation via Semantic Correspondence Learning with Dynamic Graph Convolution】
Summary
A new learning framework that automatically learns the latent relationships of AUs via establishing semantic correspondences between feature maps.
- Heatmap regression-based network: feature maps preserve rich semantic information associated with AU intensities and locations.
- GCNN: describes the intrinsic relationship between various vertex nodes of the graph by learning an adjacency matrix, to explore the relationships among multiple feature maps.
- Semantic correspondence convolution module (SCC): automatically learns the semantic relationships among feature maps to discove the latent co-occurrence relationships of AU intensities.
Key Contributions:
- leverage the semantic correspondence for modeling the implicit co-occurrence relations of AU intensity levels in a heatmap regression framework, where the feature maps encode rich semantic descriptions and spatial distributions of AUs.
- SCC module to dynamically compute the correspondences among feature maps layer by layer.
Heatmap Regression
- In Stream 1:
Each deconvolutional layer is followed with an SCC module that models the relationship among multiple feature maps at this specific resolution level. - In Stream 2:
The ground-truth possibility heatmap g i ( x ) g_i(x) gi(x) for a predefined AU location L i ( i = 1 , . . . , N ) L_i (i = {1, . . . , N}) Li(i=1,...,N) is generated by applying a Gaussian function centered on its corresponding coordinate x ^ i \hat{x}_i x^i,
g i ( x ) = I 2 π σ 2 e x p ( − ∣ ∣ x − x ^ i ∣ ∣ 2 2 2 σ 2 ) g_i(x)=\frac{I}{2\pi {\sigma}^2}exp(-\frac{{||x-\hat{x}_i||}_2^2}{2{\sigma}^2}) gi(x)=2πσ2Iexp(−2σ2∣∣x−x^i∣∣22),
// I: the labeled intensity of the specific AU;
// σ \sigma σ: the standard deviation. - Utilize the L2 distance to minimize the difference between h i ( x ; w , b ) h_i(x;w,b) hi(x;w,b) (the predicted heatmap) and g i ( x ) g_i(x) gi(x), then calculate the MSE loss.
SCC: Semantic Correspondence Convolution
- Aiming to model the correlation among feature channels, where each channel encodes a specific visual pattern of AU. The feature channels with similar semantic patterns would be activated simultaneously when a specific co-occurrence pattern of AU intensities emerges.
- In SCC module:
– first construct the k-NN graph by grouping sets of closest feature maps to find different co-occurrence patterns;
– then apply the convolution operations on the edges that connect feature maps sharing similar semantic patterns to further exploit the edge information of the graph;
– afterwards, the aggregation function, i.e., MAX, is applied to summarize the most discriminative features for improving AU intensity estimation.
Graph Construction
- The feature maps set is denoted by
F
=
{
f
1
,
f
2
,
.
.
.
,
f
n
}
⊆
R
F=\{ f_1,f_2,...,f_n\}\subseteq \mathbb R
F={f1,f2,...,fn}⊆R, and the size of each feature map (channel) is given by M×M.
Rearrange the M×M feature map in a feature vector with the length of L=M×M.
Construct the graph G as the k-nearest neighbor (k-NN) graph of F, and each node represents a specific feature map. - The edge feature is defined by
e
i
j
=
h
Θ
(
f
i
,
f
j
)
e_{ij}=h_{\Theta}(f_i,f_j)
eij=hΘ(fi,fj), where
h
Θ
:
R
L
×
R
L
→
R
L
′
h_{\Theta}:{\mathbb R}^L\times{\mathbb R}^L\rightarrow{\mathbb R}^{L'}
hΘ:RL×RL→RL′ is a nonlinear function with trainable parameters Θ.
Combine the global information encoded by f i f_i fi, with its local neighborhood characteristics, captured by f j − f i f_j-f_i fj−fi.
The edge feature function is formulated as, e i j k ′ = R e L U ( ϕ k ⋅ f i + ω k ⋅ ( f j − f i ) ) e'_{ijk}=ReLU(\phi_k\cdot f_i+\omega_k\cdot(f_j-f_i)) eijk′=ReLU(ϕk⋅fi+ωk⋅(fj−fi)),
// Θ: ( ϕ 1 , . . . , ϕ K , ω 1 , . . . , ω K ) (\phi_1,...,\phi_K,\omega_1,...,\omega_K) (ϕ1,...,ϕK,ω1,...,ωK), where K is the number of filters. - For each
f
i
f_i
fi, the k-NN graph is built by computing a pairwise distance matrix (calculated based on the Euclidean distance) and then taking the closest k feature maps.
Adopt a channel-wise aggregation function, i.e., MAX, to summarize the edge features, as it can capture the most salient features.
The output of the SCC module at the i-th vertex is then produced by, f i k ′ = max j : ( i , j ) ∈ E e i j k ′ f'_{ik}=\max\limits_{j:(i,j)\in E}{e'_{ijk}} fik′=j:(i,j)∈Emaxeijk′.
Dynamic Graph Update
The dynamic graph convolutions are performed on both low and high resolution feature maps, aiming to capture the high-order AU interactions.
The SCC module can be integrated into multiple convolutional layers, and learn to semantically group similar feature channels that would be activated together for a specific co-occurrence pattern of AU intensities.
Correspondence with AU Heatmaps
The predicted heatmap
h
i
h_i
hi for AU-i is computed as,
h
i
=
F
L
⊗
W
i
L
h_i=F^L\otimes W_i^L
hi=FL⊗WiL,
//
⊗
\otimes
⊗: the tensor product;
//
F
L
=
{
f
1
L
,
.
.
.
,
f
C
L
}
F^L=\{f_1^L,...,f_C^L\}
FL={f1L,...,fCL}: the feature maps set generated from the last SCC layer;
//
W
i
L
=
{
w
1
i
L
,
.
.
.
,
w
C
i
L
}
W_i^L=\{w_{1i}^L,...,w_{Ci}^L\}
WiL={w1iL,...,wCiL}, (i=1, 2, …, N): the the 1×1 filter bank for a specific AU-i.