分为两个阶段
- stage1:Motion Prediction with Video Diffusion Models
a. video motion field prediction
b. motion field的定义: { f 0 → i ∣ i = 1 , . . . , N } \{f_{0\rightarrow i}|i=1,...,N\} {f0→i∣i=1,...,N},每一个 f 0 → i ∈ R 2 × H × W f_{0\rightarrow i}\in \mathbb{R}^{2\times H\times W} f0→i∈R2×H×W表示当前帧和参考帧的光流,于是假设已知reference frame上面一个像素的坐标 p ∈ I 2 p\in\mathbb{I}^2 p∈I2,就可以得到每一个timestep的时候对应的坐标 p i ′ = p + f 0 → i ( p ) p_i'=p+f_{0\rightarrow i}(p) pi′=p+f0→i(p)
c. 模型的训练:分为3步
ⅰ. 首先训练LDM,以image和text作为条件,预测单帧的displacement field;
ⅱ. 冻结LDM的参数,添加temporal模块,单独训练temporal模块
ⅲ. 全量训练
ⅳ. 使用的数据是用FlowFormer++和DOT预测光流和多帧的trajectory
ⅴ. 光流使用了optical flow VAE encoder,结构和LDM autoencoder是类似的,只是输入输出变成两个通道的optical flow map
ⅵ. 还将frame stride作为motion strength经过MLP得到的结果和time embedding相加 - stage2:Video Rendering with Predicted Motion
a. 根据stage1生成的motion field和reference image来生成视频
b. 创新点:motion-augmented temporal attention
c. latent feature z ∈ R ( 1 + N ) × C l × h l × w l z\in\mathbb{R}^{(1+N)\times C_l\times h_l\times w_l} z∈R(1+N)×Cl×hl×wl,参考帧 z [ 0 ] ∈ R 1 × C l × h l × w l z[0]\in\mathbb{R}^{1\times C_l\times h_l\times w_l} z[0]∈R1×Cl×hl×wl,随后的帧 z [ 1 : N ] ∈ R N × C l × h l × w l z[1:N]\in\mathbb{R}^{N\times C_l\times h_l\times w_l} z[1:N]∈RN×Cl×hl×wl,预测的motion field { f 0 → i ∣ i = 1 , . . . , N } \{f_{0\rightarrow i}|i=1,...,N\} {f0→i∣i=1,...,N},得到warp之后的z z [ i ] ′ = W ( z [ 0 ] , f 0 → i ) z[i]'=W(z[0],f_{0\rightarrow i}) z[i]′=W(z[0],f0→i),得到 z a v g = [ z [ 0 ] , z [ 1 ] ′ , z [ 1 ] , . . . , z [ N ] ′ , z [ N ] ] ∈ R ( 1 + 2 × N ) × C l × h l × w l z_{avg}=[z[0],z[1]',z[1],...,z[N]',z[N]] \in \mathbb{R}^{(1+2\times N)\times C_l\times h_l\times w_l} zavg=[z[0],z[1]′,z[1],...,z[N]′,z[N]]∈R(1+2×N)×Cl×hl×wl之后z和z_avg分别reshape到 z ′ ∈ R ( h l × w l ) × ( 1 + N ) × C l z'\in \mathbb{R}^{(h_l\times w_l) \times (1+N)\times C_l} z′∈R(hl×wl)×(1+N)×Cl和 z a v g ′ ∈ R ( h l × w l ) × ( 1 + 2 × N ) × C l z_{avg}'\in \mathbb{R}^{(h_l\times w_l) \times (1+2\times N)\times C_l} zavg′∈R(hl×wl)×(1+2×N)×Cl,之后1d attention操作 Q = W Q z ′ , K = W K z a v g ′ , V = W V z a v g ′ Q=W^Qz',K=W^Kz'_{avg},V=W^Vz'_{avg} Q=WQz′,K=WKzavg′,V=WVzavg′ - 为了支持sparse的trajectory control,在stage1的基础上还训练的一个模型,使用controlnet的方法,输入的稀疏trajectory的
f
s
p
a
r
s
e
∈
R
N
×
2
×
H
×
W
f_{sparse}\in \mathbb{R}^{N\times 2\times H\times W}
fsparse∈RN×2×H×W和mask
m
∈
{
0
,
1
}
H
×
W
m\in \{0,1\}^{H\times W}
m∈{0,1}H×W,两个concat然后经过conv